Improving the extraction of complex regulatory events from
scientific text by using ontology-based inference
 Jung-jae Kim∗1 , Dietrich Rebholz-Schuhmann∗2

1 School of Computer Engineering, Nanyang Technological University, Singapore
2 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK


Email: Jung-jae Kim∗ - jungjae.kim@ntu.edu.sg; Dietrich Rebholz-Schuhmann - rebholz@ebi.ac.uk;

∗ Corresponding author


Abstract
Background: The extraction of complex events from biomedical text is a challenging task and requires in-depth
semantic analysis. Previous approaches associate lexical and syntactic resources with ontologies for the semantic
analysis, but fall short in testing the benefits from the use of domain knowledge.
Results: We developed a system that deduces implicit events from explicitly expressed events by using inference
rules that encode domain knowledge. We evaluated the system with the inference module on three tasks: First,
when tested against a corpus with manually annotated events, the inference module of our system contributes
53.2% of correct extractions, but does not cause any incorrect results. Second, the system overall reproduces
33.1% of the transcription regulatory events contained in RegulonDB (up to 85.0% precision) and the inference
module is required for 93.8% of the reproduced events. Third, we applied the system with minimum adaptations
to the identification of cell activity regulation events, confirming that the inference improves the performance of
the system also on this task.
Conclusions: Our research shows that the inference based on domain knowledge plays a significant role in extracting
complex events from text. This approach has great potential in recognizing the complex concepts of such
biomedical ontologies as Gene Ontology in the literature.


Background                                                        tology for incorporating domain knowledge into an
                                                                  event extraction system.
The task of extracting events from text, called event                 Events from text that have been hand-curated
extraction, is a complex process that requires various            into relational databases by biologists are actually
semantic resources to decipher the semantic features              the products of scientific reasoning supported by the
in the event descriptions. Previous approaches iden-              domain knowledge of the biologists. This process of
tify and represent the textual semantics of events                reasoning is based on linguistic evidence of such lan-
(e.g. gene regulation, gene-disease relation) by asso-            guage patterns as “A regulates B” and “expression
ciating lexical and syntactic resources with ontolo-              of Gene C” which refer to the basic events of regu-
gies [1–5]. We further explore the usage of an on-                lation and gene expression. These basic events can


                                                             36
be combined into an event with the compositional                  We utilize Gene Regulation Ontology (GRO), a
structure “A regulates (the expression of Gene C)”,           conceptual model for the domain of gene regulation
where the parentheses enclose the embedded event.             [9]. The ontology has been designed for representing
In this paper, we call such an event consisting of            the compositional semantics of both biomedical text
multiple basic events a complex event and say that            and the referential databases. GRO provides basic
it has a compositional structure. We will show that           concepts and properties of the domain, which are
the use of inference based on domain knowledge sup-           from, and cross-linked to, such biomedical ontolo-
ports the extraction of complex events from text.             gies as Gene Ontology and Sequence Ontology. We
     The previous approaches to extracting complex            use the concepts and properties of GRO to represent
events combine the basic events into compositional            the domain knowledge in form of P→Q implications,
structures according to the syntactic structures of           which we call inference rules. We also represent ex-
source sentences. However, there are two open is-             plicit events from text with GRO and apply modus
sues in curating the compositional structures into            ponens to the inference rules and the explicit events
relational databases. First, the event descriptions           to deduce implicit events.
in scientific papers are so complicated that it is of-            We implemented a system of event extraction
ten required to transform the compositional struc-            with the proposed inference module and evaluated
tures into the structures compatible with the seman-          it on three tasks, reporting that the inference signif-
tic templates of the target databases. Second, an             icantly improves the system performance.
event can be represented across sentence boundaries,
even in multiple sentences which are not linked via
anaphoric expressions (e.g. ‘it’, ‘the gene’).
     Biologists with sufficient domain knowledge have         Results
little problem in carrying out the two required tasks         We performed three evaluations to test our system.
of structural transformation and evidence combina-            Each evaluation takes two steps to answer the fol-
tion. Structural transformation is to find an event           lowing two questions, respectively: 1) How well does
that has the same meaning as the original event but           the system with the inference module extract events
with a different structure, while evidence combina-           from text and 2) how much does the inference mod-
tion is to identify a new event that can be deduced           ule contribute to the event extraction? First, we ran
from multiple events. We should encode the domain             the system on a manually annotated corpus to es-
knowledge into a logical form so that our text min-           timate the performance of the system. Second, we
ing systems can process the compositional structures          used the system for a real-world task of populating
of events, which are explicitly expressed in text and         RegulonDB, the referential database of E. coli tran-
can be extracted by language patterns, to deduce              scription regulatory network, to prove the robustness
the events with alternative structures and those im-          of the system. The first two evaluations are based
plied by a combination of multiple events. We call            on the corpora used for our previously reported ex-
the explicitly expressed events explicit events and           periments [10]. Finally, we applied the system to a
the deduced events implicit events.                           related task of extracting regulatory events on cell
     Several text mining systems have employed in-            activities and compared the results with the GOA
ference based on domain knowledge to fill in event            database [11]. While the first two evaluation tasks
templates [6–8]. They can also go beyond sentence             focus on E. coli, a prokaryotic model organism, the
boundaries and combine into an event frame the                last task deals with human genes and cells.
event attributes collected from different sentences.              Table 1 shows the event templates for the eval-
However, they do not use an ontology for represent-           uations. The first two evaluations are to extract
ing the inference rules. Moreover, they primarily             instances of the first three event templates in the ta-
deal with flat-structured event frames whose partic-          ble, while the last evaluation is to extract instances
                                                              of the two last event templates. Our system deals
ipants are physical entities (e.g. protein, residue).         with four properties of events: 1) agents which bind
To address these issues, we present a novel approach          to gene regulatory regions or control gene expression
that represents events and domain knowledge with              and cell activities; 2) patients which are regulated by
an ontology and combines basic events into a com-             the agents; 3) polarity, which tells whether the agent
positional structure where an event participant can           regulates the patient positively or negatively; and 4)
be another simpler event.                                     physical contact, which indicates whether the agent


                                                         37
regulates the patient directly by binding or indi-                (i.e. agent, patient) are correctly identified, following
rectly through other agents. Since the three evalua-              the evaluation criteria of the previous approaches [3, 12].
tions only consider the agents and patients, the event            Among the 79 events, the system has correctly identi-
templates in Table 1 include only the two properties.             fied polarity of 46 events (58.2% precision) and physi-
 Semantic template                       Gene Ontology            cal contact of 51 events (64.6% precision), while these
                                            concept               two features are not considered for estimating the sys-
 <RegulationOfGeneExpression              Regulation of           tem performance, following the evaluation criteria of the
    hasAgent=?Protein                    gene expression          previous approaches [3, 12].
    hasPatient=<GeneExpression            (GO:0010468)
       hasPatient=?Gene>>                                             To understand the contribution of the inference on
 <RegulationOfTranscription               Regulation of           the system, we have run the system without the inference
    hasAgent=?Protein                      transcription          module. It then extracts only 37 out of the successfully
    hasPatient=<Transcription             (GO:0045449)            extracted 79 events, which indicates that the inference
       hasPatient=?Gene>>                                         contributes on 53.2% of the correct results. In addition,
 <BindingOfTFTo-                          Transcription           the inference was involved in the extraction of only three
    TFBindingSiteOfDNA                    factor binding          out of the 15 incorrectly extracted events. This result
    hasAgent=?TranscriptionFactor         (GO:0008134)            supports our claim that logical inference can effectively
    hasPatient=                                                   deduce implicit textual semantics from explicit textual
       <RegulatoryDNARegion                                       semantics.
           hasPatient=?Gene>>
                                                                       We have further focused on the events whose agents
 <RegulatoryProcess                       Regulation of           are TFs for the purpose of comparing our system with
    hasAgent=?MolecularEntity              cell growth            [3, 12]. The test corpus has 305 events with TFs as
    hasPatient=<CellGrowth                (GO:0001558)            agents. The system has successfully extracted 66 events
       hasAgent=?Cell>>                                           among them (21.6% recall) and incorrectly produced 6
 <RegulatoryProcess                       Regulation of           events (91.7% precision). This performance is slightly
    hasAgent=?MolecularEntity               cell killing          better than that of [3] (90% precision, ∼20% recall) and
    hasPatient=<CellDeath                 (GO:0031341)            of [12] (84% precision).
       hasAgent=?Cell>>
   Table 1. Semantic templates for target events                      We analyzed the errors of the system as follows: The
                                                                  false positives, in total 15 errors, are mainly due to the
                                                                  inappropriate application of the loose pattern matching
                                                                  method (7 errors) (see the Methods section for details).
Evaluation against event annotation                               The other causes include parse errors (2), the neglect of
We evaluated our system first against a manually an-              negation (1), and an error in conversion from predicate
notated corpus. The corpus consists of 209 MEDLINE                argument structure to dependency structure (1). These
abstracts that contain at least one E. coli transcription         results of error analysis indicate that the three incorrect
factor (TF) name. Two curators have annotated E. coli             events, which were extracted by the system with the in-
gene regulatory events on the corpus and have agreed              ference module, are actually due to the incorrect outputs
on the final release of the annotated corpus which is             of the prior modules (e.g. pattern matching) passed to
available online1 (see [10] for details, including inter-         the inference module. In short, the inference module
annotator agreement).                                             caused no incorrect results.
    We randomly divided the corpus into two sets: One
for system development (i.e. training corpus) and the                  We also analyzed the false negatives. We found that
other for system evaluation (i.e. test corpus). The train-        29.7% of the missing events (88/296) are due to the de-
ing corpus, consisting of 109 abstracts, has 250 events           ficiency of the gene name dictionary and that 30.0%
annotated, while the test corpus, consisting of 100 ab-           (68/296) are due to the lack of anaphora resolution.
stracts, has 375 events annotated. We manually con-               The rest of the missing events (40.3%) are thus depen-
structed language patterns and inference rules, based on          dent upon pattern matching and inference. It is hard
the training corpus and a review paper (see the Methods           to distinguish errors by pattern matching from those
section for details).                                             by the inference, because the inference module takes
    The system successfully extracted 79 events from              into consideration all semantics from an entire docu-
the test corpus (21.1% recall) and incorrectly produced           ment (i.e. MEDLINE abstract) for the evidence com-
15 events (84.0% precision). We consider an extracted             bination. Therefore, the inference together with the pat-
event as correct if its two participants and their roles          tern matching affects at most 40% of the false negatives.
  1 http://www.ebi.ac.uk/∼kim/eventannotation/


                                                             38
Evaluation against RegulonDB                                           It is remarkable that the inference is inevitable for
We tested the system against the real-world task of pop-           extracting 93.8% of the RegulonDB events that are ex-
ulating RegulonDB with E. coli transcriptional regula-             tracted by our system from the corpora. In contrast, the
tory events from the literature. We used four corpora              inference module is involved in the extraction of only
that are relevant to E. coli transcription regulation [10]:        3.2% of the false negative events. The percentage 93.8%
1) the regulon.abstract corpus with 2,704 MEDLINE ab-              is much higher than 53.2% of the first evaluation. The
stracts which are references of RegulonDB, 2) the regu-            difference may be due to the fact that this second eval-
lon.fulltext corpus with the fulltexts of 436 references           uation only counts unique events, while the first evalu-
in RegulonDB, 3) the ecoli-tf.abstract corpus with 4,347           ation against the event annotations counts all extracted
MEDLINE abstracts that contain at least one E. coli                event instances. If so, these results may indicate that
TF name, and 4) the ecoli-tf.fulltext with the fulltexts           only a small amount of well-known events are frequently
of 1,812 papers among those in the ecoli-tf.abstract.              mentioned in papers in concise language forms, thus ex-
    We have measured the performance of the system                 tracted by language patterns even without the help of
for this evaluation task as follows: The precision is mea-         inference, and that the rest of the events are expressed
sured as the percentage of events found in RegulonDB               in papers with the detailed procedures of experiments
among the unique events extracted by the system, while             which led to the discovery of the events.
the recall is the percentage of the successfully extracted
events among those curated in RegulonDB. The ver-
sion of RegulonDB used for the evaluation is 6.2, con-             Adaptation for regulation of cell activities
taining 4,579 E. coli genes, 169 TFs, and 3,590 unique
                                                                   Rule-based systems are criticized for being too specific
gene regulation events. This evaluation only consid-
                                                                   to the domains for which they have been developed, so
ers events with TFs as agents because of the purpose
                                                                   much so that they cannot be straightforwardly adapted
of populating RegulonDB. The overall performance is
                                                                   for other domains. To prove the adaptability of our sys-
as follows: F-score 0.44, precision 66.6%, and recall
                                                                   tem, we have applied it to a related topic: Regulation of
33.1%. Table 2 shows the evaluation results over each
                                                                   cell activities.
test corpus, where the performance of the system with-
                                                                        The goal of this new task is to populate the GOA [11],
out the inference is displayed within pairs of parentheses.
                                                                   concerning two Gene Ontology (GO) concepts: Regula-
       Corpus          Recall Precision F-score
                                                                   tion of cell growth (GO:0001558) (shortly, RCG) and
  ecoli-tf.abstract    22.4%      77.2%        0.35
                                                                   regulation of cell death (GO:0031341) (shortly, RCD).
                      (0.3%)     (50.0%)      (0.01)
                                                                   GOA is a database which provides GO annotations to
   ecoli-tf.fulltext   24.0%      67.1%        0.35
                                                                   proteins. In short, the task is to identify the proteins
                      (1.5%)     (76.1%)      (0.03)
                                                                   that can be annotated with the two GO concepts. The
  regulon.abstract     17.1%      85.0%        0.28
                                                                   semantic templates of the two event types are defined in
                      (0.1%)     (80.0%)      (0.00)
                                                                   Table 1.
  regulon.fulltext     14.1%      74.0%        0.24
                                                                        The adaptation included only the following work: We
                      (1.2%)     (91.7%)      (0.02)
                                                                   manually collected keywords of the concepts ‘growth’
        Total          33.1%      66.6%        0.44                and ‘death’ from WordNet and constructed 40 patterns
                      (2.1%)     (79.6%)      (0.04)               for the keywords by using MedEvi [13]. As candidate
    Table 2. Evaluation against RegulonDB                          agents, we collected human gene/protein names from
     Additionally, we analyzed the effect of event types.          UniProt. We also collected cell type names from MeSH.
The precision for the events of the type “regulation of            These are newly built resources that were not required
transcription” is 85%, higher than that of [12] (77% pre-          for the first two evaluation tasks. Existing language pat-
cision), while the overall precision (67%) is predictably          terns and inference rules, for example for the concept
lower than that since the system of [12] is developed              ‘regulation’, were reused. We have not used any training
specifically for extracting regulatory events on gene tran-        corpus to further adjust the system to the new task.
scription. We included the events of the other two types,               We constructed a test corpus consisting of 13,136 ab-
which are hypernyms of “regulation of transcription”,              stracts by querying PubMed with two MeSH terms “Cell
into the result set for the evaluation, because of the             Death” and “Cell Enlargement”. The system with the
low recall for the events of “regulation of transcription”         inference module extracted 244 unique UniProt proteins
(5%). The overall recall (33%) is still lower than that            associated with RCG events and 266 unique proteins as-
of [12] (45% recall) because of the small size of the reg-         sociated with RCD events from the corpus. This eval-
ulon.fulltext corpus (436 fulltexts). Note that [12] ex-           uation also uses the two measures: Precision, the per-
tracted 42% of RegulonDB events from 2,475 fulltexts               centage of unique proteins found in GOA among the ex-
of RegulonDB references. We plan to analyze a larger               tracted proteins, and recall, the percentage of extracted
number of fulltexts in the future.                                 proteins among the protein records in GOA. GOA con-


                                                              39
tains 16 proteins among the 244 proteins of RCG events               events do not convey new information compared to the
(6.6% precision) and 100 proteins among the 266 proteins             explicit events. The performance comparison between
of RCD events (37.6% precision). Currently (2010 July),              the system with the inference and that without the in-
the GOA has 155 proteins associated with RCG (10.3%                  ference is, in a sense, to see which representations better
recall) and 908 proteins associated with RCD (11.0% re-              fit for the target templates, where the inference rules are
call). These results show that our system can be applied             designed to produce results that better match the target
to a related task with minimal adaptations.                          templates.
     We also tested the system without the inference mod-                 The previous event extraction systems often utilize
ule against the cell corpus. It identifies 193 proteins as-          rules or models whose semantics directly reflect the tar-
sociated with RCG events and 198 proteins associated                 get event templates, thus embedding linguitic and do-
with RCD events. GOA contains 13 proteins among the                  main knowledge together. In contrast, our approach
193 proteins of ROG events (6.7% precision) and 78 pro-              of separating the inference rules from the linguistic re-
teins among the 198 proteins of RCD events (39.4% pre-               sources has the following characteristics: 1) We can rep-
cision). The precision almost does not change even after             resent the semantics of sentences, which are relevant to
running without the inference module, while the recall               event extraction, according to the syntactic structures of
drops about 20% without the inference module. This                   the sentences, independently from target semantic tem-
finding is similar to what we found from the results of              plates [5]; 2) we can construct language patterns for
the second evaluation such that the precision is indepen-            event extraction without respect to target semantics,
dent from the inference, while the recall drops signifi-             considering the compositional aspect of events, which has
cantly without the inference module. But the relatively              led to the development of phrase-level patterns rather
smaller drop of recall for the new task may indicate that            than sentence-level or clause-level patterns [14]; and 3)
the inference rules developed for the first two evaluations          we can add or remove language patterns according to
have less effects on the third evaluation than the other             their semantic categories, not worrying about the side-
two evaluations.                                                     effect of domain-specific patterns, which makes the pat-
     We have manually inspected 20 out of the proteins               terns highly reusable, as shown in the third test case.
that are extracted by our system but not found in GOA,
for each event type. Among the 20 ‘false positive’ pro-
teins of the RCD concepts, we found evidence that can
support the association of 15 proteins with RCD con-                 Conclusions
cepts (75%). This means that the real precision can go               We proposed a novel approach to event extraction, using
up to 80% and more importantly that we can identify                  an ontology to represent the semantics of lexical, syntac-
new protein instances of GO concepts by using our sys-               tic, and pragmatic resources. We focused on extracting
tem. Among the 20 ‘false positive’ proteins of the RCG               regulatory events on gene expression and cell activities,
concepts, we located evidence only for 8 proteins (40%).             which are very important to molecular biology and dis-
After careful inspection, we realized that the precision of          ease studies. Our system shows the full complexity in the
the RCG-related proteins is much lower than that of the              identification of such complex events from the literature
RCD-related proteins because the language patterns for               and may guide the ontology development to innovative
RCG events, which we collected from WordNet, are not                 ways of integrating various knowledge resources.
specific to cell size growth, but may also refer to cell pro-
liferation and development which should be linked to the
other GO concepts “cell proliferation” (GO:0008283) and
“cell development” (GO:0048468). The lack of training
                                                                     Methods
corpus led to this problem, and so we plan to extend the             Our system first recognizes mentions of individual GRO
experiment to other GO concepts, establishing training               instances in text, which can be the event components.
corpora for the concept identification in text.                      It then combines them into compositional structures of
                                                                     explicit events by using language patterns. The system
                                                                     performs inference based on domain knowledge to de-
                                                                     duce implicit events from the explicit events. It finally
Discussion                                                           extracts the events that match pre-defined event tem-
As explained in the Introduction, the inference rules we             plates. Both explicit and implicit events may fit for the
introduce in this paper are to deduce implicit events from           database event templates.
explicit events. Note that unless the explicit events con-               Figures 1a and 1b show the examples of the ex-
tain enough evidence to an implicit event, we cannot                 tracted events. Figure 1a depicts the three types of
deduce the implicit event from the explicit events. In               structures from the input text: Dependency structure,
other words, the implicit events are alternative represen-           explicit event, and implicit event. An arrow between
tations of the extracted information, where the implicit             the syntactic and semantic structures indicates a cor-


                                                                40
       Input Text: In addition, both himA and himD lesions caused a sevenfold
       reduction in expression of a phi(fimA-lacZ) operon fusion in strains in
       which fimA was locked in the on phase.

   Dependency Structure:
   (caused/VB,
    (-Subject- lesions/NN,
                                             Explicit Event:
      (-Object- and/CC,
        -- both/CC,                          <RegulatoryProcess
        -- himA/NN:Gene,                      hasAgent =
        -- himD/NN:Gene)),                      <RegulatoryProcess
                                                  hasPatient =
    (-Object- reduction/NN,
                                                    <Protein name="himA">
      -- a/DT,
                                                  hasPolarity="negative">
      -- sevenfold/JJ,                                                                       Input Text: The function of OmpR appears to be the enhancement of a basal level of
      (-- in/IN,                               hasPatient =                                  ompC expression. From the results of our experiments, the site of action of OmpR was
        (-Object- expression/NN,                <RegulatoryProcess                           deduced to be in the vicinity of the upstream promoters of ompC.
          (-- of/IN,                              hasPatient =
            (-Object- fusion/NN,                   <GeneExpression                     Explicit Event (1):                                Explicit Event (2):
              (-Object- operon/NN,                    hasPatient=                      <RegulationOfGeneExpression                        <RegulatoryDNARegion
                (-- (/LRB,                             <Gene name="fimA">>              hasAgent =                                         hasAgent = <Gene name="ompC">
                                                                                          <TranscriptionFactor name="OmpR">                hasPart =
                  -- )/RRB,                       hasPolarity="negative">>              hasPatient =                                         <TranscriptionFactorBindingSiteOfDNA
                  -- fimA/NN:Gene,                                                        <GeneExpression                                      hasAgent =
                  -- lacZ/NN:Gene))))))))                                                   hasPatient =                                         <TranscriptionFactor
                                                                                              <Gene name="ompC">>                                  name="OmpR">>>
                                                                                        hasPolarity="positive">
   Implicit Event (fit for Database Template):
   <RegulationOfGeneExpression                                                                       Implicit Event (fit for Database Template):
                                                                                                     <RegulationOfTranscription
              hasAgent = <Protein name="himA">
                                                                                                       hasAgent = <TranscriptionFactor name="OmpR">
              hasPatient = <GeneExpression hasPatient=<Gene name="fimA">>                              hasPatient = <Transcription hasPatient = <Gene name="ompC">>
              hasPolarity="positive">                                                                  hasPolarity="positive" hasPhysicalContact="yes">

                                     (a) Example 1                                                                         (b) Example 2


                                                      Figure 1: Examples of event extraction

                                                          Input text

                                                Named entity recognition           Lexicon

                                              Named entity annotated text

                                                Parsing                             Parser

                                                 Dependency structure
                                                                                 Syntactic-Semantic Paired
                                                Pattern matching                                                              GRO
                                                                                     Patterns (Table 3)
                                               (Explicit) Textual semantics
                                                 represented with GRO
                                                                                     Inference Rules
                                                Inference                                (Table 4)
                                                   (Explicit+Implicit)
                                                   Textual semantics
                                                                                 Database Template
                                                Extraction
                                                                                 Semantics (Table 1)
                                              Events of pre-defined types


                                                                   Figure 2: System workflow


respondence link between two structures for a phrase.                                   No.          Syntactic pattern / Semantic pattern
                                                                                         1           (expression Noun (of Prep Object:Gene)) /
The explicit event is composed from phrasal structures
                                                                                                     <GeneExpression hasPatient=Gene>
to sentential structures by using the patterns in Table 3.                                2          (reduction Noun (in Prep Object:Patient)) /
The implicit event is deduced from the explicit events by                                            <RegulatoryProcess hasPatient=Patient
using the inference rules 1 to 3 in Table 4. TFBS stands                                                  hasPolarity=“negative”>
for TranscriptionFactorBindingSiteOfDNA. Figure 1b                                        3          (lesion Noun Object:Patient) /
shows that the explicit events of the two sentences are                                              <RegulatoryProcess hasPatient=Patient
combined to deduce the implicit event. Rule 4 in Table                                                    hasPolarity=“negative”>
4 is used for the deduction. The overall workflow of the                                  4          (cause Verb Subject:Agent Object:Patient) /
                                                                                                     <RegulatoryProcess hasAgent=Agent
system is depicted in Figure 2.                                                                           hasPatient=Patient>
                                                                                              Table 3. Example patterns


                                                                                 41
 No.     Condition(s) ⇒ Conclusion                              ing the syntactic patterns to the dependency structures
  1      <RegulatoryProcess hasPolarity=Polarity2
                                                                and combining the semantic patterns into a semantic
            hasAgent=<RegulatoryProcess
                hasPatient=Patient                              structure.
                hasPolarity=Polarity1>>                             Each pattern is a pair of a syntactic pattern and a
         ⇒ <RegulatoryProcess hasAgent=Patient                  semantic pattern. Syntactic patterns comply with de-
            hasPolarity =                                       pendency structures. The leftmost item within a pair of
                polarity sum(Polarity1,Polarity2)               parentheses (e.g. cause Verb, lesion Noun) is the head
  2      <RegulatoryProcess hasPolarity=Polarity2
                                                                of the other items within the parentheses (e.g. Sub-
            hasPatient=<RegulatoryProcess
                hasPatient=Patient                              ject:Agent, Object:Patient). A dependent item may be
                hasPolarity=Polarity1>>                         surrounded by another pair of parentheses, which forms
         ⇒ <RegulatoryProcess hasPatient=Patient                an embedded structure (e.g. Pattern 1, Pattern 2). The
            hasPolarity =                                       lexical items in the syntactic patterns are labeled with
                polarity sum(Polarity1,Polarity2)               part-of-speech (POS) tags (e.g. Verb, Noun, Prep), and
  3      <RegulatoryProcess                                     should be matched to words with the same POS tags.
            hasPatient=GeneExpression>
         ⇒ <RegulationOfGeneExpression
                                                                The dependent items have syntactic constraints that in-
            hasPatient=GeneExpression>                          dicate their roles with respect to their head items (e.g.
  4      <RegulationOfGeneExpression                            Subject, Object), and should be matched to those with
            hasAgent=TranscriptionFactor                        the syntactic roles. The dependent items may have se-
            hasPatient=<GeneExpression                          mantic variables (e.g. Agent, Patient, Gene), which in-
                hasPatient=Gene>>                               dicate the semantics of the dependent items. If the se-
         <RegulatoryDNARegion hasAgent=Gene
                                                                mantic variable of a dependent item is a concept of GRO
            hasPart=<TFBS
                hasAgent=TranscriptionFactor>>                  (e.g. Gene), the variable should match a semantic cate-
         ⇒ <RegulationOfTranscription                           gory that is identical to, or a sub type of, the specified
            hasAgent=TranscriptionFactor                        concept.
            hasPatient = <Transcription                             The semantic pattern expresses the semantics of its
                hasPatient=Gene>                                corresponding syntactic pattern. The semantic pattern
            hasPhysicalContact=“yes”>
                                                                is represented with GRO concepts (e.g. RegulatoryPro-
      Table 4. Example inference rules
                                                                cess, GeneExpression) and properties (e.g. hasAgent,
                                                                hasPatient).
                                                                    The system tries to match the syntactic patterns to
Named entity recognition                                        the dependency structures of sentences in a bottom-up
We have adopted a dictionary-based approach for named           way. For example, it matches from Pattern 1 to Pattern
entity recognition.  The dictionary contains 15,881             4 in Table 3 to the dependency structure of the example
gene/protein and operon names of E. coli, including             (1) depicted in Figure 1a. In the process, it considers
the names of 169 E. coli TF names, collected from               the syntactic and semantic constraints of the syntactic
RegulonDB and SwissProt. The recognized names are               patterns. For instance, the item ‘cause’ of the fourth
grounded with UniProt identifiers and labeled with rel-         pattern in Table 3 should match the verb ‘cause’ that
evant GRO concepts among the followings: Gene, Pro-             has both a subject and an object.
tein, Operon, and TranscriptionFactor.
                                                                    Once a syntactic pattern is successfully matched to a
                                                                node of dependency structure, its corresponding seman-
                                                                tic pattern is assigned to the node as one of its seman-
Parsing                                                         tics. If the syntactic pattern has dependent items with
We have utilized Enju, the HPSG parser [15], for syn-           semantic variables (e.g. Subject:Agent, Object:Patient),
tactic analysis of sentences. While the Enju parser pro-        the variables (e.g. Agent, Patient) are replaced with the
duces predicate-argument structures, we have developed          semantics of the children of the node that have been
a module to convert them into dependency structures             matched to the dependent items. In this way, the se-
and selectively merged the predicate-argument structure         mantics of multiple phrases is combined into sentential
into the dependency structure. We have identified the           semantics. In Figure 1a, the small boxes with dashed
dependency structure for the loose matching of language         lines show the semantics assigned to the internal nodes
patterns explained below.                                       of the example (1), which are later combined into the
                                                                textual sentential semantics.
                                                                    Note that the node ‘lesions’ is assigned two pieces of
Pattern Matching                                                semantics for the two gene names that are the children
To identify the explicit events from sentences, the sys-        of the node (i.e. himA, himD). The explicit textual se-
tem utilizes syntactic-semantic paired patterns, match-         mantics of Figure 1a is one of the two, while the other is


                                                           42
a duplicate of Sem1 except that the gene name ‘himA’                 plicit events in Figure 1a has a cascaded structure with
is replaced with ‘himD’.                                             four basic event instances (i.e. three RegulatoryProcess,
     One important feature of the pattern matching is                one GeneExpression) and is transformed by Rules 1 and
that we loosely match the syntactic patterns to the de-              2 to fit for the database template that has only two event
pendency structures. For instance, the gene name ‘fimA’              instances (i.e. RegulationOfGeneExpression, GeneEx-
is not a direct child of the preposition ‘of’, but is matched        pression). Rule 3 deduces the specific event type Regula-
to the item Object:Gene of the first pattern in Table 4.             tionOfGeneExpression from a general type of event (i.e.
We have decided to match a dependent item not only                   RegulatoryProcess). Rule 4 reflects the domain knowl-
to a direct child of the node matched to the head item,              edge that if a transcription factor both binds to the reg-
but also to any descendant of the node. The feature is               ulatory region of a gene and regulates the gene’s expres-
based on two reasons: First, it is practically impossible            sion level, it is the transcriptional regulator of the gene.
to construct all potential patterns for the event extrac-            Note that the two conditions of Rule 4 can be matched
tion, though a reasonably large number of patterns for               to events from any sentences; in other words, Rule 4
gene regulation have been accumulated; and second, the               can merge multiple evidence from different sentences into
lexical entries not matched to any of the patterns for gene          a fact. The function polarity sum works exactly like
regulation (e.g. ‘sevenfold’, ‘operon’, ‘fusion’) might not          NXOR (Not Exclusive OR) operation in Boolean logic.
affect the extraction of the events.                                 The rules are repeatedly applied over the explicit events
     This loose matching still works under the following             from a given text until no additional event is generated.
strict conditions: 1) An item with a syntactic role (e.g.                We have implemented a program that converts the
Subject) can be matched to one of descendants under                  inference rules into Prolog programming codes and a
the sub-tree with the syntactic role; 2) once an item                Prolog application that executes the rules over input
is matched to a node, it is not further matched to the               events. We could not use the OWL-DL reasoners (e.g.
node’s descendants; and 3) it does not jump over clausal             Pellet) because of the DL-safe restriction of the rea-
boundaries (e.g. ‘which’) and several exceptional words              soners. DL-safe restriction assumes that all instances
(e.g. ‘except’).                                                     of rules, both in conditions and in conclusions, should
                                                                     be available at the knowledge base [17]. Unfortunately,
                                                                     however, the rules for the event extraction generate new
                                                                     instances of events and event attributes in the conclu-
Inference
                                                                     sions. Nonetheless, we can still utilize the reasoners
The inference step is to transduce explicit textual se-
                                                                     to validate the ontology populated with the extracted
mantics (or events) into implicit semantics (or events).
                                                                     events.
It deduces a new specific event instance, if possible, by
combining any two or more general events. The inference
module takes as input the explicit events from a text
(i.e. a MEDLINE abstract, a fulltext) identified by the              Extraction
previous module of pattern matching. It applies to the               The system finally selects the events that match given se-
explicit events the inference rules that reflect common              mantic templates among those resulted from either pat-
sense knowledge and domain knowledge, as exemplified                 tern matching or inference. Table 1 shows the event
in Table 4.                                                          templates. The variables are marked with ‘?’ and are
     An inference rule has the propositional logic form of           matched to the instances of the concepts referred to by
P→Q, where P is a set of conditions and Q is the conclu-             the variables. For example, the variable “?Protein” can
sion. It works with the modus ponens rule (i.e. P, P→Q               be matched to a protein name. Non-variable concepts
⊢ Q). That is, if all the conditions P of a rule match               and properties are used as semantic restriction on the
some of the identified events from a text, the conclusion            events to extracted. For example, the last template in
Q is instantiated and then added as an additional event              Table 1 can be matched to an instance of NegativeReg-
of the text. As the input events are represented with                ulation, which a child of RegulatoryProcess. In addi-
GRO, the inference rules and their resultant events are              tion, the patient of the instance should an instance of
also represented with GRO.                                           CellDeath and the agent can be a gene, where Gene is a
     We have constructed 28 inference rules for deal-                descendant of MolecularEntity.
ing with the compositional structures of gene regulation
events (e.g. Rules 1, 2) and for deducing biological events
from the combination of linguistic events (e.g. Rules 3,
4) by consulting the training corpus and the review pa-              Authors contributions
per [16] (see Table 4).                                              JJK conceived the study, designed and implemented
     For example, Rules 1 and 2 flatten, if possible, the            the system, carried out the evaluations and drafted the
compositional structure of event descriptions. The ex-               manuscript. DRS motivated and coordinated the study


                                                                43
and revised the manuscript.                                        8. Culotta A, McCallum A, Betz J: Integrating proba-
                                                                      bilistic extraction models and data mining to dis-
                                                                      cover relations and patterns in text. In Proceedings
                                                                      of Human Language Technology Conference of the North
Acknowledgements                                                      American Chapter of the Association of Computational
We would like to thank Vivian Lee and Ruth Lovering                   Linguistics 2006:296–303.
for their contribution on the event annotation, and Nick           9. Beisswanger E, Lee V, Kim JJ, Rebholz-Schuhmann D,
Luscombe and Aswin Seshasayee for their helping us to                 Splendiani A, Dameron O, Schulz S, Hahn U: Gene Reg-
learn the domain knowledge of gene transcription reg-                 ulation Ontology (GRO): Design Principles and
                                                                      Use Cases. Studies in Health Technology and Informat-
ulation. We would also like to thank the anonymous
                                                                      ics 2008, 136:9–14.
reviewers for their valuable comments.
                                                                  10. Hahn U, Tomanek K, Buyko E, Kim JJ, Rebholz-
                                                                      Schuhmann D: How Feasible and Robust is the Au-
                                                                      tomatic Extraction of Gene Regulation Events?
References                                                            A Cross-Method Evaluation under Lab and Real-
 1. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin            Life Conditions. In Proceedings of BioNLP 2009
    A, Mazo I: Extracting human protein interactions                  2009:37–45.
    from MEDLINE using a full-sentence parser.                    11. Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan
    Bioinformatics 2004, 20(5):604–611.                               C, Apweiler R: The GOA database in 2009 – an
 2. Cimiano P, Reyle U, Saric J: Ontology-based dis-                  integrated Gene Ontology Annotation resource.
    course analysis for information extraction. Data &                Nucleic Acids Research 2009, 37:D396–D403.
    Knowledge Engineering 2005, 55:59–83.                         12. Rodrı́guez-Penagos C, Salgado H, Martı́nez-Flores I,
 3. Saric J, Jensen LJ, Rojas I: Large-scale extraction of            Collado-Vides J: Automatic reconstruction of a bac-
    gene regulation for model organisms in an onto-                   terial regulatory network using Natural Language
    logical context. Silico Biology 2005, 5:21–32.                    Processing. BMC Bioinformatics 2007, 8:293.

 4. Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL,          13. Kim JJ, Pezik P, Rebholz-Schuhmann D: MedEvi: Re-
    Ogren PV, Cohen KB: OpenDMAP: an open source,                     trieving textual evidence of relations between
    ontology-driven concept analysis engine, with ap-                 biomedical concepts from Medline. Bioinformatics
    plications to capturing knowledge regarding pro-                  2008, 24(11):1410–1412.
    tein transport, protein interactions and cell-type-           14. Kim JJ, Chae YS, Choi KS: Phrase-Pattern-based
    specific gene expression. BMC Bioinformatics 2008,                Korean to English Machine Translation using Two
    9:78.                                                             Level Translation Pattern Selection. In Proceedings
 5. Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Overview             of 38th Association for Computational Lingusitics (ACL)
    of BioNLP’09 shared task on event extraction. In                  2000:31–36.
    Proceedings of the Workshop on BioNLP: Shared Task            15. Sagae K, Miyao Y, Tsujii J: HPSG parsing with shal-
    2009:1–9.                                                         low dependency constraints. In Proceedings of the
 6. Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P:                45th Annual Meeting of the Association of Computa-
    Protein structures and information extraction                     tional Linguistics, Prague, Czech Republic 2007:624–631.
    from biological texts: The PASTA system. Bioin-               16. Browning DF, Busby SJW: The regulation of bacte-
    formatics 2003, 19:135–143.                                       rial transcription initiation. Nature Reviews Microbi-
 7. Narayanaswamy M, Ravikumar KE, Vijay-Shanker K:                   ology 2004, 2:57–65.
    Beyond the clause: extraction of phosphorylation              17. Motik B, Sattler U, Studer R: Query answering for
    information from medline abstracts. Bioinformatics                OWL-DL with rules. Web Semantics: Science, Ser-
    2005, 21(Suppl. 1):i319–i327.                                     vices and Agents on the World Wide Web 2005, 3:41–60.


                                                             44