=Paper=
{{Paper
|id=None
|storemode=property
|title=Improving the extraction of complex regulatory events from scientific text by using ontology-based inference
|pdfUrl=https://ceur-ws.org/Vol-714/Paper05_Kim.pdf
|volume=Vol-714
|dblpUrl=https://dblp.org/rec/conf/smbm/KimR10
}}
==Improving the extraction of complex regulatory events from scientific text by using ontology-based inference==
Improving the extraction of complex regulatory events from
scientific text by using ontology-based inference
Jung-jae Kim∗1 , Dietrich Rebholz-Schuhmann∗2
1 School of Computer Engineering, Nanyang Technological University, Singapore
2 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Email: Jung-jae Kim∗ - jungjae.kim@ntu.edu.sg; Dietrich Rebholz-Schuhmann - rebholz@ebi.ac.uk;
∗ Corresponding author
Abstract
Background: The extraction of complex events from biomedical text is a challenging task and requires in-depth
semantic analysis. Previous approaches associate lexical and syntactic resources with ontologies for the semantic
analysis, but fall short in testing the benefits from the use of domain knowledge.
Results: We developed a system that deduces implicit events from explicitly expressed events by using inference
rules that encode domain knowledge. We evaluated the system with the inference module on three tasks: First,
when tested against a corpus with manually annotated events, the inference module of our system contributes
53.2% of correct extractions, but does not cause any incorrect results. Second, the system overall reproduces
33.1% of the transcription regulatory events contained in RegulonDB (up to 85.0% precision) and the inference
module is required for 93.8% of the reproduced events. Third, we applied the system with minimum adaptations
to the identification of cell activity regulation events, confirming that the inference improves the performance of
the system also on this task.
Conclusions: Our research shows that the inference based on domain knowledge plays a significant role in extracting
complex events from text. This approach has great potential in recognizing the complex concepts of such
biomedical ontologies as Gene Ontology in the literature.
Background tology for incorporating domain knowledge into an
event extraction system.
The task of extracting events from text, called event Events from text that have been hand-curated
extraction, is a complex process that requires various into relational databases by biologists are actually
semantic resources to decipher the semantic features the products of scientific reasoning supported by the
in the event descriptions. Previous approaches iden- domain knowledge of the biologists. This process of
tify and represent the textual semantics of events reasoning is based on linguistic evidence of such lan-
(e.g. gene regulation, gene-disease relation) by asso- guage patterns as “A regulates B” and “expression
ciating lexical and syntactic resources with ontolo- of Gene C” which refer to the basic events of regu-
gies [1–5]. We further explore the usage of an on- lation and gene expression. These basic events can
36
be combined into an event with the compositional We utilize Gene Regulation Ontology (GRO), a
structure “A regulates (the expression of Gene C)”, conceptual model for the domain of gene regulation
where the parentheses enclose the embedded event. [9]. The ontology has been designed for representing
In this paper, we call such an event consisting of the compositional semantics of both biomedical text
multiple basic events a complex event and say that and the referential databases. GRO provides basic
it has a compositional structure. We will show that concepts and properties of the domain, which are
the use of inference based on domain knowledge sup- from, and cross-linked to, such biomedical ontolo-
ports the extraction of complex events from text. gies as Gene Ontology and Sequence Ontology. We
The previous approaches to extracting complex use the concepts and properties of GRO to represent
events combine the basic events into compositional the domain knowledge in form of P→Q implications,
structures according to the syntactic structures of which we call inference rules. We also represent ex-
source sentences. However, there are two open is- plicit events from text with GRO and apply modus
sues in curating the compositional structures into ponens to the inference rules and the explicit events
relational databases. First, the event descriptions to deduce implicit events.
in scientific papers are so complicated that it is of- We implemented a system of event extraction
ten required to transform the compositional struc- with the proposed inference module and evaluated
tures into the structures compatible with the seman- it on three tasks, reporting that the inference signif-
tic templates of the target databases. Second, an icantly improves the system performance.
event can be represented across sentence boundaries,
even in multiple sentences which are not linked via
anaphoric expressions (e.g. ‘it’, ‘the gene’).
Biologists with sufficient domain knowledge have Results
little problem in carrying out the two required tasks We performed three evaluations to test our system.
of structural transformation and evidence combina- Each evaluation takes two steps to answer the fol-
tion. Structural transformation is to find an event lowing two questions, respectively: 1) How well does
that has the same meaning as the original event but the system with the inference module extract events
with a different structure, while evidence combina- from text and 2) how much does the inference mod-
tion is to identify a new event that can be deduced ule contribute to the event extraction? First, we ran
from multiple events. We should encode the domain the system on a manually annotated corpus to es-
knowledge into a logical form so that our text min- timate the performance of the system. Second, we
ing systems can process the compositional structures used the system for a real-world task of populating
of events, which are explicitly expressed in text and RegulonDB, the referential database of E. coli tran-
can be extracted by language patterns, to deduce scription regulatory network, to prove the robustness
the events with alternative structures and those im- of the system. The first two evaluations are based
plied by a combination of multiple events. We call on the corpora used for our previously reported ex-
the explicitly expressed events explicit events and periments [10]. Finally, we applied the system to a
the deduced events implicit events. related task of extracting regulatory events on cell
Several text mining systems have employed in- activities and compared the results with the GOA
ference based on domain knowledge to fill in event database [11]. While the first two evaluation tasks
templates [6–8]. They can also go beyond sentence focus on E. coli, a prokaryotic model organism, the
boundaries and combine into an event frame the last task deals with human genes and cells.
event attributes collected from different sentences. Table 1 shows the event templates for the eval-
However, they do not use an ontology for represent- uations. The first two evaluations are to extract
ing the inference rules. Moreover, they primarily instances of the first three event templates in the ta-
deal with flat-structured event frames whose partic- ble, while the last evaluation is to extract instances
of the two last event templates. Our system deals
ipants are physical entities (e.g. protein, residue). with four properties of events: 1) agents which bind
To address these issues, we present a novel approach to gene regulatory regions or control gene expression
that represents events and domain knowledge with and cell activities; 2) patients which are regulated by
an ontology and combines basic events into a com- the agents; 3) polarity, which tells whether the agent
positional structure where an event participant can regulates the patient positively or negatively; and 4)
be another simpler event. physical contact, which indicates whether the agent
37
regulates the patient directly by binding or indi- (i.e. agent, patient) are correctly identified, following
rectly through other agents. Since the three evalua- the evaluation criteria of the previous approaches [3, 12].
tions only consider the agents and patients, the event Among the 79 events, the system has correctly identi-
templates in Table 1 include only the two properties. fied polarity of 46 events (58.2% precision) and physi-
Semantic template Gene Ontology cal contact of 51 events (64.6% precision), while these
concept two features are not considered for estimating the sys-
> To understand the contribution of the inference on
> contributes on 53.2% of the correct results. In addition,
>
We have further focused on the events whose agents
> among them (21.6% recall) and incorrectly produced 6
>
Table 1. Semantic templates for target events We analyzed the errors of the system as follows: The
false positives, in total 15 errors, are mainly due to the
inappropriate application of the loose pattern matching
method (7 errors) (see the Methods section for details).
Evaluation against event annotation The other causes include parse errors (2), the neglect of
We evaluated our system first against a manually an- negation (1), and an error in conversion from predicate
notated corpus. The corpus consists of 209 MEDLINE argument structure to dependency structure (1). These
abstracts that contain at least one E. coli transcription results of error analysis indicate that the three incorrect
factor (TF) name. Two curators have annotated E. coli events, which were extracted by the system with the in-
gene regulatory events on the corpus and have agreed ference module, are actually due to the incorrect outputs
on the final release of the annotated corpus which is of the prior modules (e.g. pattern matching) passed to
available online1 (see [10] for details, including inter- the inference module. In short, the inference module
annotator agreement). caused no incorrect results.
We randomly divided the corpus into two sets: One
for system development (i.e. training corpus) and the We also analyzed the false negatives. We found that
other for system evaluation (i.e. test corpus). The train- 29.7% of the missing events (88/296) are due to the de-
ing corpus, consisting of 109 abstracts, has 250 events ficiency of the gene name dictionary and that 30.0%
annotated, while the test corpus, consisting of 100 ab- (68/296) are due to the lack of anaphora resolution.
stracts, has 375 events annotated. We manually con- The rest of the missing events (40.3%) are thus depen-
structed language patterns and inference rules, based on dent upon pattern matching and inference. It is hard
the training corpus and a review paper (see the Methods to distinguish errors by pattern matching from those
section for details). by the inference, because the inference module takes
The system successfully extracted 79 events from into consideration all semantics from an entire docu-
the test corpus (21.1% recall) and incorrectly produced ment (i.e. MEDLINE abstract) for the evidence com-
15 events (84.0% precision). We consider an extracted bination. Therefore, the inference together with the pat-
event as correct if its two participants and their roles tern matching affects at most 40% of the false negatives.
1 http://www.ebi.ac.uk/∼kim/eventannotation/
38
Evaluation against RegulonDB It is remarkable that the inference is inevitable for
We tested the system against the real-world task of pop- extracting 93.8% of the RegulonDB events that are ex-
ulating RegulonDB with E. coli transcriptional regula- tracted by our system from the corpora. In contrast, the
tory events from the literature. We used four corpora inference module is involved in the extraction of only
that are relevant to E. coli transcription regulation [10]: 3.2% of the false negative events. The percentage 93.8%
1) the regulon.abstract corpus with 2,704 MEDLINE ab- is much higher than 53.2% of the first evaluation. The
stracts which are references of RegulonDB, 2) the regu- difference may be due to the fact that this second eval-
lon.fulltext corpus with the fulltexts of 436 references uation only counts unique events, while the first evalu-
in RegulonDB, 3) the ecoli-tf.abstract corpus with 4,347 ation against the event annotations counts all extracted
MEDLINE abstracts that contain at least one E. coli event instances. If so, these results may indicate that
TF name, and 4) the ecoli-tf.fulltext with the fulltexts only a small amount of well-known events are frequently
of 1,812 papers among those in the ecoli-tf.abstract. mentioned in papers in concise language forms, thus ex-
We have measured the performance of the system tracted by language patterns even without the help of
for this evaluation task as follows: The precision is mea- inference, and that the rest of the events are expressed
sured as the percentage of events found in RegulonDB in papers with the detailed procedures of experiments
among the unique events extracted by the system, while which led to the discovery of the events.
the recall is the percentage of the successfully extracted
events among those curated in RegulonDB. The ver-
sion of RegulonDB used for the evaluation is 6.2, con- Adaptation for regulation of cell activities
taining 4,579 E. coli genes, 169 TFs, and 3,590 unique
Rule-based systems are criticized for being too specific
gene regulation events. This evaluation only consid-
to the domains for which they have been developed, so
ers events with TFs as agents because of the purpose
much so that they cannot be straightforwardly adapted
of populating RegulonDB. The overall performance is
for other domains. To prove the adaptability of our sys-
as follows: F-score 0.44, precision 66.6%, and recall
tem, we have applied it to a related topic: Regulation of
33.1%. Table 2 shows the evaluation results over each
cell activities.
test corpus, where the performance of the system with-
The goal of this new task is to populate the GOA [11],
out the inference is displayed within pairs of parentheses.
concerning two Gene Ontology (GO) concepts: Regula-
Corpus Recall Precision F-score
tion of cell growth (GO:0001558) (shortly, RCG) and
ecoli-tf.abstract 22.4% 77.2% 0.35
regulation of cell death (GO:0031341) (shortly, RCD).
(0.3%) (50.0%) (0.01)
GOA is a database which provides GO annotations to
ecoli-tf.fulltext 24.0% 67.1% 0.35
proteins. In short, the task is to identify the proteins
(1.5%) (76.1%) (0.03)
that can be annotated with the two GO concepts. The
regulon.abstract 17.1% 85.0% 0.28
semantic templates of the two event types are defined in
(0.1%) (80.0%) (0.00)
Table 1.
regulon.fulltext 14.1% 74.0% 0.24
The adaptation included only the following work: We
(1.2%) (91.7%) (0.02)
manually collected keywords of the concepts ‘growth’
Total 33.1% 66.6% 0.44 and ‘death’ from WordNet and constructed 40 patterns
(2.1%) (79.6%) (0.04) for the keywords by using MedEvi [13]. As candidate
Table 2. Evaluation against RegulonDB agents, we collected human gene/protein names from
Additionally, we analyzed the effect of event types. UniProt. We also collected cell type names from MeSH.
The precision for the events of the type “regulation of These are newly built resources that were not required
transcription” is 85%, higher than that of [12] (77% pre- for the first two evaluation tasks. Existing language pat-
cision), while the overall precision (67%) is predictably terns and inference rules, for example for the concept
lower than that since the system of [12] is developed ‘regulation’, were reused. We have not used any training
specifically for extracting regulatory events on gene tran- corpus to further adjust the system to the new task.
scription. We included the events of the other two types, We constructed a test corpus consisting of 13,136 ab-
which are hypernyms of “regulation of transcription”, stracts by querying PubMed with two MeSH terms “Cell
into the result set for the evaluation, because of the Death” and “Cell Enlargement”. The system with the
low recall for the events of “regulation of transcription” inference module extracted 244 unique UniProt proteins
(5%). The overall recall (33%) is still lower than that associated with RCG events and 266 unique proteins as-
of [12] (45% recall) because of the small size of the reg- sociated with RCD events from the corpus. This eval-
ulon.fulltext corpus (436 fulltexts). Note that [12] ex- uation also uses the two measures: Precision, the per-
tracted 42% of RegulonDB events from 2,475 fulltexts centage of unique proteins found in GOA among the ex-
of RegulonDB references. We plan to analyze a larger tracted proteins, and recall, the percentage of extracted
number of fulltexts in the future. proteins among the protein records in GOA. GOA con-
39
tains 16 proteins among the 244 proteins of RCG events events do not convey new information compared to the
(6.6% precision) and 100 proteins among the 266 proteins explicit events. The performance comparison between
of RCD events (37.6% precision). Currently (2010 July), the system with the inference and that without the in-
the GOA has 155 proteins associated with RCG (10.3% ference is, in a sense, to see which representations better
recall) and 908 proteins associated with RCD (11.0% re- fit for the target templates, where the inference rules are
call). These results show that our system can be applied designed to produce results that better match the target
to a related task with minimal adaptations. templates.
We also tested the system without the inference mod- The previous event extraction systems often utilize
ule against the cell corpus. It identifies 193 proteins as- rules or models whose semantics directly reflect the tar-
sociated with RCG events and 198 proteins associated get event templates, thus embedding linguitic and do-
with RCD events. GOA contains 13 proteins among the main knowledge together. In contrast, our approach
193 proteins of ROG events (6.7% precision) and 78 pro- of separating the inference rules from the linguistic re-
teins among the 198 proteins of RCD events (39.4% pre- sources has the following characteristics: 1) We can rep-
cision). The precision almost does not change even after resent the semantics of sentences, which are relevant to
running without the inference module, while the recall event extraction, according to the syntactic structures of
drops about 20% without the inference module. This the sentences, independently from target semantic tem-
finding is similar to what we found from the results of plates [5]; 2) we can construct language patterns for
the second evaluation such that the precision is indepen- event extraction without respect to target semantics,
dent from the inference, while the recall drops signifi- considering the compositional aspect of events, which has
cantly without the inference module. But the relatively led to the development of phrase-level patterns rather
smaller drop of recall for the new task may indicate that than sentence-level or clause-level patterns [14]; and 3)
the inference rules developed for the first two evaluations we can add or remove language patterns according to
have less effects on the third evaluation than the other their semantic categories, not worrying about the side-
two evaluations. effect of domain-specific patterns, which makes the pat-
We have manually inspected 20 out of the proteins terns highly reusable, as shown in the third test case.
that are extracted by our system but not found in GOA,
for each event type. Among the 20 ‘false positive’ pro-
teins of the RCD concepts, we found evidence that can
support the association of 15 proteins with RCD con- Conclusions
cepts (75%). This means that the real precision can go We proposed a novel approach to event extraction, using
up to 80% and more importantly that we can identify an ontology to represent the semantics of lexical, syntac-
new protein instances of GO concepts by using our sys- tic, and pragmatic resources. We focused on extracting
tem. Among the 20 ‘false positive’ proteins of the RCG regulatory events on gene expression and cell activities,
concepts, we located evidence only for 8 proteins (40%). which are very important to molecular biology and dis-
After careful inspection, we realized that the precision of ease studies. Our system shows the full complexity in the
the RCG-related proteins is much lower than that of the identification of such complex events from the literature
RCD-related proteins because the language patterns for and may guide the ontology development to innovative
RCG events, which we collected from WordNet, are not ways of integrating various knowledge resources.
specific to cell size growth, but may also refer to cell pro-
liferation and development which should be linked to the
other GO concepts “cell proliferation” (GO:0008283) and
“cell development” (GO:0048468). The lack of training
Methods
corpus led to this problem, and so we plan to extend the Our system first recognizes mentions of individual GRO
experiment to other GO concepts, establishing training instances in text, which can be the event components.
corpora for the concept identification in text. It then combines them into compositional structures of
explicit events by using language patterns. The system
performs inference based on domain knowledge to de-
duce implicit events from the explicit events. It finally
Discussion extracts the events that match pre-defined event tem-
As explained in the Introduction, the inference rules we plates. Both explicit and implicit events may fit for the
introduce in this paper are to deduce implicit events from database event templates.
explicit events. Note that unless the explicit events con- Figures 1a and 1b show the examples of the ex-
tain enough evidence to an implicit event, we cannot tracted events. Figure 1a depicts the three types of
deduce the implicit event from the explicit events. In structures from the input text: Dependency structure,
other words, the implicit events are alternative represen- explicit event, and implicit event. An arrow between
tations of the extracted information, where the implicit the syntactic and semantic structures indicates a cor-
40
Input Text: In addition, both himA and himD lesions caused a sevenfold
reduction in expression of a phi(fimA-lacZ) operon fusion in strains in
which fimA was locked in the on phase.
Dependency Structure:
(caused/VB,
(-Subject- lesions/NN,
Explicit Event:
(-Object- and/CC,
-- both/CC,
-- a/DT,
hasPolarity="negative">
-- sevenfold/JJ, Input Text: The function of OmpR appears to be the enhancement of a basal level of
(-- in/IN, hasPatient = ompC expression. From the results of our experiments, the site of action of OmpR was
(-Object- expression/NN, > hasAgent = hasAgent =
hasPart =
-- )/RRB, hasPolarity="negative">> hasPatient = > name="OmpR">>>
hasPolarity="positive">
Implicit Event (fit for Database Template):
hasAgent =
hasPatient = > hasPatient = >
hasPolarity="positive"> hasPolarity="positive" hasPhysicalContact="yes">
(a) Example 1 (b) Example 2
Figure 1: Examples of event extraction
Input text
Named entity recognition Lexicon
Named entity annotated text
Parsing Parser
Dependency structure
Syntactic-Semantic Paired
Pattern matching GRO
Patterns (Table 3)
(Explicit) Textual semantics
represented with GRO
Inference Rules
Inference (Table 4)
(Explicit+Implicit)
Textual semantics
Database Template
Extraction
Semantics (Table 1)
Events of pre-defined types
Figure 2: System workflow
respondence link between two structures for a phrase. No. Syntactic pattern / Semantic pattern
1 (expression Noun (of Prep Object:Gene)) /
The explicit event is composed from phrasal structures
to sentential structures by using the patterns in Table 3. 2 (reduction Noun (in Prep Object:Patient)) /
The implicit event is deduced from the explicit events by
for TranscriptionFactorBindingSiteOfDNA. Figure 1b 3 (lesion Noun Object:Patient) /
shows that the explicit events of the two sentences are
4 is used for the deduction. The overall workflow of the 4 (cause Verb Subject:Agent Object:Patient) /
Table 3. Example patterns
41
No. Condition(s) ⇒ Conclusion ing the syntactic patterns to the dependency structures
1 > Each pattern is a pair of a syntactic pattern and a
⇒ > surrounded by another pair of parentheses, which forms
⇒
⇒ dicate their roles with respect to their head items (e.g.
4 > dicate the semantics of the dependent items. If the se-
> (e.g. Gene), the variable should match a semantic cate-
⇒ corresponding syntactic pattern. The semantic pattern
hasPhysicalContact=“yes”>
is represented with GRO concepts (e.g. RegulatoryPro-
Table 4. Example inference rules
cess, GeneExpression) and properties (e.g. hasAgent,
hasPatient).
The system tries to match the syntactic patterns to
Named entity recognition the dependency structures of sentences in a bottom-up
We have adopted a dictionary-based approach for named way. For example, it matches from Pattern 1 to Pattern
entity recognition. The dictionary contains 15,881 4 in Table 3 to the dependency structure of the example
gene/protein and operon names of E. coli, including (1) depicted in Figure 1a. In the process, it considers
the names of 169 E. coli TF names, collected from the syntactic and semantic constraints of the syntactic
RegulonDB and SwissProt. The recognized names are patterns. For instance, the item ‘cause’ of the fourth
grounded with UniProt identifiers and labeled with rel- pattern in Table 3 should match the verb ‘cause’ that
evant GRO concepts among the followings: Gene, Pro- has both a subject and an object.
tein, Operon, and TranscriptionFactor.
Once a syntactic pattern is successfully matched to a
node of dependency structure, its corresponding seman-
tic pattern is assigned to the node as one of its seman-
Parsing tics. If the syntactic pattern has dependent items with
We have utilized Enju, the HPSG parser [15], for syn- semantic variables (e.g. Subject:Agent, Object:Patient),
tactic analysis of sentences. While the Enju parser pro- the variables (e.g. Agent, Patient) are replaced with the
duces predicate-argument structures, we have developed semantics of the children of the node that have been
a module to convert them into dependency structures matched to the dependent items. In this way, the se-
and selectively merged the predicate-argument structure mantics of multiple phrases is combined into sentential
into the dependency structure. We have identified the semantics. In Figure 1a, the small boxes with dashed
dependency structure for the loose matching of language lines show the semantics assigned to the internal nodes
patterns explained below. of the example (1), which are later combined into the
textual sentential semantics.
Note that the node ‘lesions’ is assigned two pieces of
Pattern Matching semantics for the two gene names that are the children
To identify the explicit events from sentences, the sys- of the node (i.e. himA, himD). The explicit textual se-
tem utilizes syntactic-semantic paired patterns, match- mantics of Figure 1a is one of the two, while the other is
42
a duplicate of Sem1 except that the gene name ‘himA’ plicit events in Figure 1a has a cascaded structure with
is replaced with ‘himD’. four basic event instances (i.e. three RegulatoryProcess,
One important feature of the pattern matching is one GeneExpression) and is transformed by Rules 1 and
that we loosely match the syntactic patterns to the de- 2 to fit for the database template that has only two event
pendency structures. For instance, the gene name ‘fimA’ instances (i.e. RegulationOfGeneExpression, GeneEx-
is not a direct child of the preposition ‘of’, but is matched pression). Rule 3 deduces the specific event type Regula-
to the item Object:Gene of the first pattern in Table 4. tionOfGeneExpression from a general type of event (i.e.
We have decided to match a dependent item not only RegulatoryProcess). Rule 4 reflects the domain knowl-
to a direct child of the node matched to the head item, edge that if a transcription factor both binds to the reg-
but also to any descendant of the node. The feature is ulatory region of a gene and regulates the gene’s expres-
based on two reasons: First, it is practically impossible sion level, it is the transcriptional regulator of the gene.
to construct all potential patterns for the event extrac- Note that the two conditions of Rule 4 can be matched
tion, though a reasonably large number of patterns for to events from any sentences; in other words, Rule 4
gene regulation have been accumulated; and second, the can merge multiple evidence from different sentences into
lexical entries not matched to any of the patterns for gene a fact. The function polarity sum works exactly like
regulation (e.g. ‘sevenfold’, ‘operon’, ‘fusion’) might not NXOR (Not Exclusive OR) operation in Boolean logic.
affect the extraction of the events. The rules are repeatedly applied over the explicit events
This loose matching still works under the following from a given text until no additional event is generated.
strict conditions: 1) An item with a syntactic role (e.g. We have implemented a program that converts the
Subject) can be matched to one of descendants under inference rules into Prolog programming codes and a
the sub-tree with the syntactic role; 2) once an item Prolog application that executes the rules over input
is matched to a node, it is not further matched to the events. We could not use the OWL-DL reasoners (e.g.
node’s descendants; and 3) it does not jump over clausal Pellet) because of the DL-safe restriction of the rea-
boundaries (e.g. ‘which’) and several exceptional words soners. DL-safe restriction assumes that all instances
(e.g. ‘except’). of rules, both in conditions and in conclusions, should
be available at the knowledge base [17]. Unfortunately,
however, the rules for the event extraction generate new
instances of events and event attributes in the conclu-
Inference
sions. Nonetheless, we can still utilize the reasoners
The inference step is to transduce explicit textual se-
to validate the ontology populated with the extracted
mantics (or events) into implicit semantics (or events).
events.
It deduces a new specific event instance, if possible, by
combining any two or more general events. The inference
module takes as input the explicit events from a text
(i.e. a MEDLINE abstract, a fulltext) identified by the Extraction
previous module of pattern matching. It applies to the The system finally selects the events that match given se-
explicit events the inference rules that reflect common mantic templates among those resulted from either pat-
sense knowledge and domain knowledge, as exemplified tern matching or inference. Table 1 shows the event
in Table 4. templates. The variables are marked with ‘?’ and are
An inference rule has the propositional logic form of matched to the instances of the concepts referred to by
P→Q, where P is a set of conditions and Q is the conclu- the variables. For example, the variable “?Protein” can
sion. It works with the modus ponens rule (i.e. P, P→Q be matched to a protein name. Non-variable concepts
⊢ Q). That is, if all the conditions P of a rule match and properties are used as semantic restriction on the
some of the identified events from a text, the conclusion events to extracted. For example, the last template in
Q is instantiated and then added as an additional event Table 1 can be matched to an instance of NegativeReg-
of the text. As the input events are represented with ulation, which a child of RegulatoryProcess. In addi-
GRO, the inference rules and their resultant events are tion, the patient of the instance should an instance of
also represented with GRO. CellDeath and the agent can be a gene, where Gene is a
We have constructed 28 inference rules for deal- descendant of MolecularEntity.
ing with the compositional structures of gene regulation
events (e.g. Rules 1, 2) and for deducing biological events
from the combination of linguistic events (e.g. Rules 3,
4) by consulting the training corpus and the review pa- Authors contributions
per [16] (see Table 4). JJK conceived the study, designed and implemented
For example, Rules 1 and 2 flatten, if possible, the the system, carried out the evaluations and drafted the
compositional structure of event descriptions. The ex- manuscript. DRS motivated and coordinated the study
43
and revised the manuscript. 8. Culotta A, McCallum A, Betz J: Integrating proba-
bilistic extraction models and data mining to dis-
cover relations and patterns in text. In Proceedings
of Human Language Technology Conference of the North
Acknowledgements American Chapter of the Association of Computational
We would like to thank Vivian Lee and Ruth Lovering Linguistics 2006:296–303.
for their contribution on the event annotation, and Nick 9. Beisswanger E, Lee V, Kim JJ, Rebholz-Schuhmann D,
Luscombe and Aswin Seshasayee for their helping us to Splendiani A, Dameron O, Schulz S, Hahn U: Gene Reg-
learn the domain knowledge of gene transcription reg- ulation Ontology (GRO): Design Principles and
Use Cases. Studies in Health Technology and Informat-
ulation. We would also like to thank the anonymous
ics 2008, 136:9–14.
reviewers for their valuable comments.
10. Hahn U, Tomanek K, Buyko E, Kim JJ, Rebholz-
Schuhmann D: How Feasible and Robust is the Au-
tomatic Extraction of Gene Regulation Events?
References A Cross-Method Evaluation under Lab and Real-
1. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin Life Conditions. In Proceedings of BioNLP 2009
A, Mazo I: Extracting human protein interactions 2009:37–45.
from MEDLINE using a full-sentence parser. 11. Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan
Bioinformatics 2004, 20(5):604–611. C, Apweiler R: The GOA database in 2009 – an
2. Cimiano P, Reyle U, Saric J: Ontology-based dis- integrated Gene Ontology Annotation resource.
course analysis for information extraction. Data & Nucleic Acids Research 2009, 37:D396–D403.
Knowledge Engineering 2005, 55:59–83. 12. Rodrı́guez-Penagos C, Salgado H, Martı́nez-Flores I,
3. Saric J, Jensen LJ, Rojas I: Large-scale extraction of Collado-Vides J: Automatic reconstruction of a bac-
gene regulation for model organisms in an onto- terial regulatory network using Natural Language
logical context. Silico Biology 2005, 5:21–32. Processing. BMC Bioinformatics 2007, 8:293.
4. Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, 13. Kim JJ, Pezik P, Rebholz-Schuhmann D: MedEvi: Re-
Ogren PV, Cohen KB: OpenDMAP: an open source, trieving textual evidence of relations between
ontology-driven concept analysis engine, with ap- biomedical concepts from Medline. Bioinformatics
plications to capturing knowledge regarding pro- 2008, 24(11):1410–1412.
tein transport, protein interactions and cell-type- 14. Kim JJ, Chae YS, Choi KS: Phrase-Pattern-based
specific gene expression. BMC Bioinformatics 2008, Korean to English Machine Translation using Two
9:78. Level Translation Pattern Selection. In Proceedings
5. Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Overview of 38th Association for Computational Lingusitics (ACL)
of BioNLP’09 shared task on event extraction. In 2000:31–36.
Proceedings of the Workshop on BioNLP: Shared Task 15. Sagae K, Miyao Y, Tsujii J: HPSG parsing with shal-
2009:1–9. low dependency constraints. In Proceedings of the
6. Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P: 45th Annual Meeting of the Association of Computa-
Protein structures and information extraction tional Linguistics, Prague, Czech Republic 2007:624–631.
from biological texts: The PASTA system. Bioin- 16. Browning DF, Busby SJW: The regulation of bacte-
formatics 2003, 19:135–143. rial transcription initiation. Nature Reviews Microbi-
7. Narayanaswamy M, Ravikumar KE, Vijay-Shanker K: ology 2004, 2:57–65.
Beyond the clause: extraction of phosphorylation 17. Motik B, Sattler U, Studer R: Query answering for
information from medline abstracts. Bioinformatics OWL-DL with rules. Web Semantics: Science, Ser-
2005, 21(Suppl. 1):i319–i327. vices and Agents on the World Wide Web 2005, 3:41–60.
44