=Paper=
{{Paper
|id=Vol-1718/paper5
|storemode=property
|title=Design of a Extraction System for Definitional Contexts from Biomedical Corpora
|pdfUrl=https://ceur-ws.org/Vol-1718/paper5.pdf
|volume=Vol-1718
|authors=César Aguilar,Olga Acosta
|dblpUrl=https://dblp.org/rec/conf/ijcai/AguilarA16
}}
==Design of a Extraction System for Definitional Contexts from Biomedical Corpora ==
Design of a Extraction System for Definitional Contexts from Biomedical Corpora
César Aguilar and Olga Acosta
Pontificia Universidad Católica de Chile, Santiago de Chile
caguilara@uc.cl
Cognitiva Latinoamérica, Santiago de Chile
oacosta@cognitiva.la
Abstract [In general Discursive Pattern], the [paraprofessional workers Term
+ Typographical Pattern] [are defined as Verbal Phrase] [those persons
In this paper we show a general advance about the
desgin of a methodology for extracting definitional who are engaged in the provision of social care or social
contexts from corpus of biomedicine in Spanish, services, but who do not have professional training or
taking into account a set of processes performed by qualifications Definition]
the following modules: (i) a term extractor based in According to this example, the term paraprofessional
a hybrid method, (ii) a set of verbs that configure workers is emphasized by the use of bold font; the verbal
the syntactic structure of a definitional context, (iii) phrase are defined as links the term paraprofessional
a chunker able to recognize those noun phrases that workers to the actual definition those persons who are
introduce a definition, considering the lexical engaged... The term, the verbal phrase and the definition are
relation of hyponymy/hypernymy, where the discursive units introduced by the pragmatic pattern in
hyponym is the term defined, and the hypernym is general.
the Genus Term which represents a conceptual We conceive our method considering three central tasks:
category associated with such term.
A term extraction that recognizes candidates to
1 Introduction terms using a hybrid method based grammatical
It is not surprising that, given the overwhelming amount of rules and stochastic techniques [Acosta, Aguilar
biomedical knowledge recorded in physical and electronic and Infante, 2015].
texts, currently there is an interest for developing semantics The use of a set of verbs that configure some
resources and tools oriented to improve the search and specific kind of verbal phrase, called predicative
classification of biomedical concepts. Projects such as Gene phrases [Rothstein, 1983; Bowers, 1993; 2001],
Ontology [Smith et al., 2005], or BioText Search Engine whose function is to link terms and definitions in
[Hearst et al., 2007] are good examples of systems capable a DC.
to extract and organize concepts, taking into account lexical- The identification of lexical relations, particular-
semantic relationships expressed in natural language. ly hyponymy/hyperonymy relations, in order to
Most of these projects have been developed for English,
detect candidate to analytical (or Aristotelian)
having in mind the big amount of documents produced. A
paradigmatic example is PubMed, a search engine with definitions, following the method proposed by
accessing primarily the MEDLINE database of references Hearts [1992], Wilks, Slator and Guthrie [1996],
and abstracts on biomedical topics. PubMed has been used as well as Acosta, Sierra and Aguilar [2011;
in experiments oriented to the automatic classification of 2015].
concepts extracted from large-corpora [Smith et al., 2005]. Our paper is organized as follow: in the section 2
However, in Latin America, including Chile, there are no we describe in more detail the extraction of DCs
such projects in NLP. In order to fill this gap, we sketch from specialized corpora, attending the role of
here a method for extracting definitional contexts the predicative phrases (henceforth, PrPs) as
(abbreviated DCs), which are discursive structures that grammatical linker among terms and definitions.
contain relevant information to define a term. A DC has at Then, in section 3, we briefly explain our term
least three constituents: a term, a definition, and a verbal
extractor, and show some results generated
phrase that links both previous. Concurrently, we can
identify other linguistic or metalinguistic units, whose searching biomedical terms in Spanish. In sec-
function is to highlight the presence of a DC in a text, e.g.: tion 4, we show and describe a set of verbs that
discursive and typographical patterns [Sierra et al., 2008; syntactically work as head of PrPs, and introduce
Acosta, Sierra and Aguilar, 2011]. An example is: analytical definitions in a DC. In section 5 we
expose of methodology employed for identify
hyponyms and hyperonyms expressed in a bio- 3 Term Extraction
medical Spanish documents, specifically situated
We have developed a methodology for extracting single-
in DCs.
word and multi-word terms from text-corpora, reported in
Acosta, Aguilar and Infante (2015). Such methodology is
supported for a hybrid approach, which including both a
2 Extraction of DCs linguistic and a statistical phases.
The development of methods and electronic tools for In the linguistic part, the most frequent syntactic patterns
extracting conceptual information from texts has become an are used to filter out candidate terms while, at the same
important task in NLP, mainly related with computational time, removing non-relevant words from these candidates.
lexicography [Wilks, Slator and Guthrie, 1996], In the statistical part, a corpus comparison approach is used
terminology [Malaisé, Zweigenbaum and Bachimont, 2005] to rank domain words [Kit and Liu, 2008]. A word
and, in recent years, the building of ontologies [Navigli and occurring in both the reference and the domain corpus is
Velardi, 2004; Velardi, Faralli and Navigli, 2013]. ranked using relative frequency ratio [Manning and Schütze,
Reviewing in detail the criteria used to perform this type of 1999]. Given that words closely related with a domain
extraction, we can recognize three ideas in common: should have a higher occurrence probability in that domain
than in a reference corpus, we view a large reference corpus
Concepts are represented, in a natural language, by as an effective method for assigning relevance to domain
words, phrases or sentences. Thus, a definition is a words occurring in both corpora. If this ranking process is
linguistic structure useful for expressing this con- effective, the domain words will have higher weights than
ceptual information [Sierra et al, 2008]. words not related to the domain.
If definitions are linguistic representations of con- For determining what word is a good candidate of term,
cepts, then it is possible to recognize regular pat- we consider the notions of termhood and unithood proposed
terns in lexical, syntactic, semantic and discursive by Kageura and Umino [1996]. The termhood is described
levels [Wilks, Slator and Guthrie, 1996]. as the degree that a linguistic unit is related to domain-
The use of statistical methods and computational specific concepts. In contrast, the unithood refers to the
tools for searching and extracting these regular strength of syntagmatic combinations and collocations
patterns in large corpora. Therefore, the results are which can be recognized as potential candidates to terms.
evaluated in order to determine if such patterns Thus, in the final stage, the word ranking can be used to
extract multi-word candidate terms, so that words with high
represent good or bad candidates to definitions
weights will contribute to increase the ranking of noun
[Malaisé, Zweigenbaum and Bachimont, 2005].
phrases when they are present (multi-word termhood). In the
In line with these works and ideas, Sierra et al. [2008] case of the unithood, we consider this to be assured in part
delineate a method for recognizing and extracting terms and for a syntactic filter [Vivaldi and Rodríguez, 2007] and the
definitions expressed in DCs. As we have mentioned before, occurrence frequency of the noun phrase as a whole.
terms, PrPs and definitions configure the core of a DC, Additionally, we propose implementing linguistic heuristics
because these units show a recurrent use in specialized for automatically build a stopword list of non-relevant
documents. Additionally, discursive and typographical adjectives from the domain corpus. This latter is relevant
patterns could be seen as optional units whose function is to since adjectives (primarily relational adjectives) have a
introduce or indicate a potential DC in a text. We can compositional interpretation so that traditional measures
represent the relation between all these units in this scheme: (e.g., mutual information) fail in the task of showing the
unithood of multi-word candidates.
We put attention in the terms represented for noun
phrases (NPs) whose modifier is a relational adjective,
because they assign a set of properties derived from an
entity. In biomedical terminology, relational adjectives
represent an important element for building specialized
Figure 1, constitutive units of a DC structure
terms, e.g.: inguinal hernia, venereal disease, psychological
disorder and others. For extracting these NPs with
Having in mind this scheme, our proposal for extracting
relational adjectives, we build a chunker that distinguishes
DCs in biomedical texts considers the identification of the
main units, that is: terms, PrPs and definitions. Each unit in the following patterns:
analyzed for a particular module, and the integration of all
modules configures the architecture of our extraction
system.
Where RG, AQ and VAE tags correspond to adverbs, Table 1, percentages of precision in the extraction of
adjectives and the verb estar (Eng. To Be), respectively. The terms using the adjective filter taken from reference corpus
tags correspond to determinants,
pronouns, punctuation signs and prepositions. The LLR RD RFR TS
expression is a restriction to reduce 500 74.2 76.4 79 33.2
noise, since elements wrongly tagged as adjectives are 1000 66.4 70.5 72.7 28.9
1500 58.9 64.7 67.3 24.6
extracted without this constraint. These tags are part of the
2000 53.9 64.5 60.7 18.7
system of annotation proposed for FreeLing (Carreras et al., 2500 50.1 63.8 56.6 14.9
2004), which we have employed for tagging two corpora: 3000 48.4 60.1 53.8 12.4
3500 48.6 53.6 53.3
A domain corpus composed for texts about hu- 4000 49.4 48.6 49.5
man body diseases and related topics (surgeries, 4500 44.0 44.0 44.0
treatments, and so on) collected from Med- 5000 39.6 39.6 39.6
linePlus in Spanish. The size of this corpus is 1.2
million tokens. Table 2, percentages of recall in the extraction of terms
A reference corpus conformed for news and arti- using the adjective filter taken from reference corpus
cles extracted from an online newspaper1 from
2014. The size of this corpus is about 5 millions LLR RD RFR TS
of tokens. 500 16.5 17.0 17.5 7.4
1000 29.5 31.3 32.3 12.8
Using these chunker and patterns we perform an experi- 1500 39.2 43.1 44.8 16.4
ment for identifying terms, comparing whit four measures 2000 47.8 57.3 53.9 16.6
proposed by the following works: 2500 55.6 70.8 62.8 16.6
3000 64.4 80.1 71.6 16.6
3500 75.5 83.3 82.9
The log-likelihood ratio implemented by Gelbuk 4000 87.6 86.3 87.8
et al. [2010], abbreviated as LLR. 4500 87.8 87.8 87.8
The word rank difference employed by Kit and 5000 87.8 87.8 87.8
Liu [2008], abbreviated RD.
The relative frequence reason, considered by 4 DCs and PrPs
Manning y Schütze [1999], abbreviated RFR.
In the case of PrPs, according to the analysis reported by
Finally, a binomial approximation using the Sierra et al. [2008], as well Aguilar, Acosta and Sierra
standard normal distribution applied by Drouin [2010], these phrases configure the syntactic core of a DC.
[2003] for the TermoStat extraction system, ab- Syntactically, all PrP is structured around a relation X-is-a-
breviated simply TS. Subject-of/Y-is-a-predicate-of. This relation is regulated by
a syntactic rule named rule of predicate linking, proposed
From a general point of view, in our experiment an im- by Rothstein [1983]. This rule establishes a relation of satu-
portant step is to eliminate the noise from terms removing ration among the subject and the predicate, deriving two
the non-relevant adjectives automatically obtained from the basic conditions:
domain corpus, as well as those words whose relative fre-
quency in the reference corpus is greater than that in the I. X is the subject of the predicate of Y, if X is
domain corpus. linked to Y.
When we detect all the no-relevant adjectives, we gener- II. If Y is the predicate of X, then Y cannot be
ate a list as a filter for removing it, and then we can extract predicated of anything else other than X.
those NPs with relational adjectives.
Finally, once applied this filter, we obtained a precision Following Rothstein’s explanation, Bowers [1993, 2001]
of around 72.7%, considering the RFR measure, and the RD develops a simple model to describe the syntactic configura-
measure with 70.5%, specifically in the first 1000 candi- tion of these phrases. The PrP is mapped by a functional
dates detected). head, and its grammatical behaviour is similar to that of
On the other hand, in the case of the global recall, we ob- phrases such as Inflexional Phrase (IP) or Complement
tained proximally 73% also in the 1000 candidates. In the Phrase (CP).
tables 1 and 2 we show the results of our experiment, con- Based on this description, we can infer two types of
trasting precision and recall. predicative phrases: a primary predication, i.e., those
predicative phrases conformed by a subject to the left of the
verb, and a predicate that is located to the right of the verb:
[Conjunctivitis [is [an inflammation of the
1 La Jornada. WEB site: www.lajornada.com.mx. Mexican conjunctiva of the eye NP] PrP] NP]
newspaper with information available online.
In contrast, a secondary predication integrates a subject in a 0.58, and a recall of 0.83 for analytical definitions
pre-verbal position, and an object and its predicate, both linked to verbs used in primary predications as ser (to
after the verb. In this case, the predicate affects the object be), significar (to mean/to signify), and also verbs used
of a sentence: in secondary predications as concebir (to conceive)
definir (to define), entender (to undestand), identificar
[Watson and Crick [define [the DNA [as a molecule (to identify), etc. Attending the individual score of these
[that carries the genetic instructions used in the verbs, the most relevant are concebir (precision
development, functioning and reproduction of all 0.71/recall 0.98) definir (precision 0.84/recall 0.98),
known living organisms CP] PrP]NP]VP]IP] contrasting whit others like entender (precision
0.36/recall 0.95), and identificar (precision 0.31/recall
A relevant difference observed in both examples is the
explicit mention of the author(s) of the definition in the DC. 0.90).
According to Aguilar, Acosta and Sierra [2010], it is
possible to determine two specific patterns: 5 Hyponymy/hyperonymy extraction
The results of the extraction of DCs using PrPs allow to
(i) A pattern that follows the sequence Term + PrP
develop a method for recognize analytical definitions,
+ Definition, which is recognized as a primary focusing in the detection of the Genus Term introduces
predication.
for the verb that works as a head of PrP. We face this
(ii) Other pattern that follows the sequence Author +
task of detection taking into account the prototype
Term + PrP + Definition, which is recognized as theory proposed by Rosch and Lloyd [1978], applied to
a secondary predication.
the description of categorization processes. Based on
Taking into account such kinds of PrPs, we can identify this theory, we can recognize a distinction among basic
analytical definitions, assigning to its components, Genus and subordinate categories: in the first case the single-
Term and Differentia, a specific syntactic pattern. Thus, in word terms represented for nouns as enfermedad
the case of definitions associated to primary predications, (disease), corazón (heart), sistema (system), etc., which
the pattern is: represent basic categories, as opposed with the second
case where multi-words terms represent subordinates
Table 3, construction pattern for primary predication categories: enfermedad venérea (venereal disease), paro
linked to analytical definition cardiaco (heart atack), sistema nervioso (nervous
Definition Genus Term Differentia system), and others.
Analytical Noun Phrase = CP = Relative Pronoun + We used this distintion (single-word versus multi-
(Primary Noun + IP word) not only for identifying terms, but also hyponyms
PrP) {AdjP/PP}* PP = Preposition + NP and hypernyms, attending the role of the relational
AdjP = Adjective + NP adjectives and the preposition de (of/from). We
formulate a set of possible term patterns recognizible in
In contrast, in the case of analytical definitions related to medical documents:
secondary predications, the construction pattern is:
Table 5, Term patterns
Table 4, construction pattern for secondary predication Pattern Example
linked to analytical definition Noun + Adjective (Spanish) Enfermedad cardiovascular
Definition Adverb/ Genus Term Differentia Adjective + Noun (English) Cardiovascular disease
Preposition Noun + Prepositional Phrase Enfermedad de Alzheimer
Analytical Como NP = Noun CP = Relative (Spanish) Alzheimer's disease
(Secondary Por + {AdjP Pronoun + IP Noun + Noun Diabetes mellitus
PrP) /PP}* PP = Preposition + Acronyms VIH
NP HIV
AdjP = Adjective + Noun + Letter Vitamina A
NP Vitamin A
Letter + Noun H Pylori
The use of these patterns of PrPs for extracting terms
and definitions has allowed to reach good results. For In our experiments for finding hyponyms and
example: Sierra et al. [2008], as well as Alarcón, Sierra hypernyms, we only consider relational adjectives
and Bach [2008] explored a specialized corpora about [Acosta, Aguilar and Sierra, 2013; Acosta, Sierra and
human genome and medicine (among others), integrated Aguilar, 2011; 2015], exploring a corpus of medical
to the system BwanaNet developed by the IULA- texts in Spanish, with a size of 1.3 million of words,
UPF2, and they obtained a precision level around collected from MedLinePlus, the search engine of
PubMed.
In order to identify patterns of NPs associated to
2 For more reference about BwanaNet, see the following link: hypernyms and hyponims, we develop an heuristic
http://bwananet.iula.upf.edu/index.htm based on the detection of relational adjetives. Thus, we
consider H as set of all single-word hyperonyms
implicit in a corpus, and F the set of the most frequent
hyperonyms in a set of candidate analytical definitions
by establishing a specific frequency threshold m:
F = {x x H, freq(x) m}
On the other hand, NP is the set of noun phrases
representing candidate categories:
NP = {np head(np) F, modifier(np) adjective} Figure 2, methodology for extracting subordinate categories
Subordinate categories C of a basic level b are those We obtain a set of NPs associated to relational adjectives
holding: and its frequency. Then, the NPs with hyperonyms as head
b are selected, and we calculate the pointwise mutual
C = {np head(np) F, modifier(np) relational- information (PMI) for each combination. Given its use in
adjective} collocation extraction, we select a PMI measure, where PMI
Where modifier (np) representing an adjective modifier thresholds are established in order to filter non-relevant
from a noun phrase np with head b. Returning with (NR) information. We considered the normalized PMI
Rosch and Lloyd [1978], these subcategories show measure proposed by Bouma(2009):
relevant differences respect to a basic level of This normalized variant is due to two issues: to use
categorization.
6 Desing a system for DC extraction
In the following section, we sketch our method for association measures whose values have a fixed
searching DCs, integrating in modules the tasks previusly interpretation, and to reduce sensibility to low frequencies
exposed. of data occurrence.
6.1 Methodology 6.2 Corpus analysis and computational tools
We focus our efforts in analytical definitions, assuming that As we have mentioned, our corpus is constituted for a set of
such definitions are the best source finding hyponymy- medical documents, basically human body diseases and
hyperonymy relations. Our method started to pre-processing related topics (surgeries, treatments, and so on), collected
a text corpus, in order to tokenize it. Then we annotate this from MedlinePlus in Spanish. Additionally, we use NLTK
corpus with POS tags, using the TreeTagger [Schmid, module [Bird, Klein and Loper, 2009], a set of open codes
1994]. programming in Python language for analysing texts, in
Once made it, we employ syntactical and semantic filters order to create a chunk parser for searching candidates to
for generating the first candidates of analytical definitions. terms and hypernyms represented for NPs.
The syntactical filter consists on a chunk grammar consider- Integrating all the tasks exposed (the extraction of terms,
ing verb characteristics of analytical definitions, and its con- the detection of PrPs associated to definitions, and the
textual patterns [Sierra et al., 2010], as well as syntactical recognition of hyponyms/hypernyms), we conceive our
structure of the most common constituents such as term, methodology having in mind the following sequence of
synonyms, and hypernyms. steps:
On the other hand, the semantic phase filters candidates
i) Processing a corpus and inserted POS tags for
by means of a list of noun heads indicating relations part- starting the extraction.
whole and causal as well as empty heads semantically not ii) Appliying the syntactic and semantics filters for
related with term defined. An additional step extracts terms generating candidates to DCs.
and hypernyms from candidate set. iii) We confirm the quality of these candidates if: (a)
In the case of the extraction of subordinate categories, we they are linked to a term linked to a PrP, and (b) they
consider NPs with relational adjectives as modifiers of a introduce a hyponymy/hyperonymy relation among
term. The Figure 2 shows this process: the term and the Genus Term of a definition.
In the figure 3 we sketch our method:
Figure 3, architecture of prototype system for extracting DCs
The architecture proposed here is an advance in the
identification of DCs. According to the results reported by 7 Final considerations
Acosta, Sierra and Aguilar [2015], the levels of precision
and recall increase significantly when it is included the In this paper we have delineate a method for extracting DCs
detection of hyponyms and hypernyms, in comparison to the from biomedical corpus in Spanish. Based on our
results showed by Alarcón, Sierra and Bach [2008]: preliminary results, we consider that we have achieved a
considerable improvement taking into account the role of
Table 6, Comparison of results the hyponymy/hyperonymy relations as an important
element to validate autentical analytical definitions
Precision Recall expressed in DCs.
Alarcón, Sierra and Bach [2008] 41% 46% This consideration allows to observe a particular relation
Acosta, Sierra and Aguilar [2015] 62% 58% among syntactic structures and lexical-semantic information
formulated in such definitions: on the one hand, it is not
Hypernyms, as generic classes of a domain, are expected to enough to search DCs based only syntactic sequences,
be related to a great deal of modifiers such as relational although such structures can be considered as an interface
adjectives reflecting more specific categories (e.g., for accessing such lexical-semantic information.
cardiovascular disease) than hyperonyms, or simply On the other hand, this task for recognizing hyponyms
sensitive descriptions to a specific context (e.g., rare and hypernyms DCs ca be an important step for building
disease). In the table 7, we show the hypernym enfermedad ontologies based on text information, in line with the model
(Ing. disease) and the first most related subset of 50 proposed by Buitelaar, Cimiano and Magnini [2005]. The
adjectives, taking into account its PMI values. In this hyponymy/hyperonymy relation allows to infer a conceptual
example, only 30 out of 50 (60%) are relevant relations. In hierarchy between terms (in our case, situated in a
total, disease is related to 132 adjectives, of which 76 (58%) biomedical domain), according to the categorization
can be considered relevant: formulated by experts of a specific area. Although it is
necessary to explore other lexical-semantic relations (e. g.
Table 7, First 50 adjectives linked to the noun enfermedad synonymy of meronymy), we can start initially with the
advances achieved by our methodology, in order to
implement as well as possible our prototype system.
References
[Acosta, Sierra and Aguilar, 2011] Olga Acosta, Gerardo
Sierra and César Aguilar. Extraction of Definitional
Contexts using Lexical Relations. International Journal
of Computer Applications, 34(6): 46-53, November
2011.
[Acosta, Aguilar and Infante, 2015] Olga Acosta, César
Aguilar and Tomás Infante. Reconocimiento de términos
en español mediante la aplicación de un enfoque de Wooldridge and Jerry Ye. BioText search engine:
comparación entre corpus. Linguamática, 7(2):19–34, beyond abstract search. Bioinformatics, 23(16): 2196-
December 2015. 2197, August 2007.
[Acosta, Aguilar and Sierra, 2015] Olga Acosta, César [Kageura and Umino, 1996] Kio Kageura and Bin Umino.
Aguilar and Gerardo Sierra. Extracting definitional Methods of automatic term recognition: A review.
contexts in Spanish through the identification of Terminology, 3(2):259-289, . 1996.
hyponymy-hyperonymy relations. In Jan Žižka and [Kit and Liu, 2008] Chunyu Kit and Xiaoyue Liu.
František Dařena (eds.), Modern Computational Models Measuring mono-word termhood by rank difference via
of Semantic Discovery in Natural Language, pages 48- corpus comparison. Terminology, 14(2):204-229, 2008.
70. IGI Global, Hershey, Pennsylvania, USA, 2015.
[Malaisé, Zweigenbaum, and Bachimont, 2005] Malaisé,
[Aguilar, Acosta and Sierra, 2010] César Aguilar, Olga Véronique, Zweigenbaum, Pierre and Bachimont, Bruno.
Acosta and Gerardo Sierra. Recognition and extraction Mining defining contexts to help structuring differential
of definitional contexts in Spanish for sketching a lexical ontologies, Terminology 11(1):21-53, 2005.
network. In Thamar Solorio and Ted Pedersen (eds.),
Proceedings of 1st young investigators workshop on [Manning and Schütze, 1999] Chris Manning and Hinrich
computational approaches to languages of the Americas, Schütze. Foundations of Statistical Natural Language
pages 109-116, ACL Publications, Stroudsburg, USA, Processing. MIT Press, Cambridge, Massachusetts,
2010. 1999.
[Alarcón, Sierra and Bach, 2008] Rodrigo Alarcón, Gerardo [Navigli and Velardi, 2004] Roberto Navigli and Paola
Sierra and Carme Bach. ECODE: A Pattern Based Velardi. Learning Domain Ontologies from Document
Approach for Definitional Knowledge Extraction. In Warehouses and Dedicated Web Sites. Computational
Elisenda Bernal and Janet DeCesaris (eds.), Proceedings Linguistics, 30(2):151-179, 2004.
of the XIII EURALEX International Congress, pages [Rosch and Lloyd, 1978] Eleanor Rosch and Barbara Lloyd.
923-928, IULA-UPF, Barcelona, España, 2008. Cognition and categorization, Erlbaum, Hillsdale, New
[Bird, Klein and Loper, 2009] Steven Bird, Ewan Klein and Jersey, 1978.
Edward Loper. Natural Language Processing whit [Rothstein, 1983] Susan Rothstein, The syntax forms of
Python. O'Reilly, Sebastropol, California, USA, 2009. predication, Ph. D. Thesis, MIT, Cambridge,
[Bowers, 2001] John Bowers. The syntax of predication, Massachusetts, 1983.
Linguistic Inquiry, 24(4):591-636, 1993. [Schmid, 1994] Helmut Schmid. Probabilistic Part-of-
[Bowers, 1993] John Bowers, Predication. In Mark Baltin Speech Tag-ging Using Decision Trees. In Proceedings
and Chris Collins (eds.), The Handbook of of International Conference of New Methods in Lan-
Contemporary Syntactic Theory. Blackwell, Oxford, guage. Manchester, UK, 1994. WEB Site: www.cis.uni-
UK:299-333. muenchen.de/~schmid/tools/TreeTagger/.
[Buitelaar, Cimiano and Magnini, 2005] Paul Buitelaar, [Sierra et al., 2008] Gerardo Sierra, Rodrigo Alarcón, César
Philipp Cimiano and Bernardo Magnini. Ontology Aguilar and Carme Bach. Definitional verbal patterns for
learning from text. IOS Press, Amsterdam, The semantic relation extraction. Terminology, 14(1):74–98,
Netherlands, 2005. 2008.
[Drouin 2003] Patrick Drouin. Term extraction using non- [Smith et al., 2005] Barry Smith, Werner Ceusters, Bert
technical corpora as a point of leverage. Terminology, Klagges, Jacob Köhler, Anand Kumar, Jane Lomax,
9(1):99-115, 2003. Chris Mungall, Fabian Neuhaus, Alan L Rector and
Cornelius Rosse. Relations in biomedical ontologies.
[Gelbuk et al., 2010] Alexander Gelbukh, Grigori Sidorov, Genome Biology, 6 (5):R-46, 2005.
Eduardo Lavin, y Liliana Chanona. Automatic Term
Extraction using log-likelihood based comparison with [Velardi, Faralli and Navigli, 2013] Paola Velardi, Stefano
general reference corpus. In Christina Hopfe, Yacine Faralli and Roberto Navigli. OntoLearn Reloaded: A
Rezgui, Elisabeth Métais, Alun Preece and Haijiang Li Graph-based Algorithm for Taxonomy Induction.
(eds.), Natural Language Processing and Information Computational Linguistics, 39(3):665-707, 2013.
Systems. LNCS, pages 248-255, Springer, Berlin, 2010. [Vivaldi and Rodríguez, 2007] Vivaldi, Jorge, y Horacio
[Hearst, 1992] Marti Hearst. Automatic acquisition of Rodríguez. Evaluation of terms and term extraction
hyponyms from large text corpora. In Proceedings of the systems: A practical approach". Terminology, 13(2):225-
Fourteenth International Conference on Computational 248, 2007.
Linguistics, pages 539-545, Nantes, France, ACL [Wilks, Slator and Guthrie, 1995] Yorick Wilks, Brian M.
Publications, 1992. Slator and Louise M. Guthrie. Electric words, MIT
[Hearst et al., 2007] Marti Hearst, Anna Divoli, Harendra Press, Cambridge, Massachusetts, 1995.
Guturu, Alex Ksikes, Preslav Nakov, Michael