Improving the extraction of complex regulatory events from scientific text by using ontology-based inference Jung-jae Kim∗1 , Dietrich Rebholz-Schuhmann∗2 1 School of Computer Engineering, Nanyang Technological University, Singapore 2 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Email: Jung-jae Kim∗ - jungjae.kim@ntu.edu.sg; Dietrich Rebholz-Schuhmann - rebholz@ebi.ac.uk; ∗ Corresponding author Abstract Background: The extraction of complex events from biomedical text is a challenging task and requires in-depth semantic analysis. Previous approaches associate lexical and syntactic resources with ontologies for the semantic analysis, but fall short in testing the benefits from the use of domain knowledge. Results: We developed a system that deduces implicit events from explicitly expressed events by using inference rules that encode domain knowledge. We evaluated the system with the inference module on three tasks: First, when tested against a corpus with manually annotated events, the inference module of our system contributes 53.2% of correct extractions, but does not cause any incorrect results. Second, the system overall reproduces 33.1% of the transcription regulatory events contained in RegulonDB (up to 85.0% precision) and the inference module is required for 93.8% of the reproduced events. Third, we applied the system with minimum adaptations to the identification of cell activity regulation events, confirming that the inference improves the performance of the system also on this task. Conclusions: Our research shows that the inference based on domain knowledge plays a significant role in extracting complex events from text. This approach has great potential in recognizing the complex concepts of such biomedical ontologies as Gene Ontology in the literature. Background tology for incorporating domain knowledge into an event extraction system. The task of extracting events from text, called event Events from text that have been hand-curated extraction, is a complex process that requires various into relational databases by biologists are actually semantic resources to decipher the semantic features the products of scientific reasoning supported by the in the event descriptions. Previous approaches iden- domain knowledge of the biologists. This process of tify and represent the textual semantics of events reasoning is based on linguistic evidence of such lan- (e.g. gene regulation, gene-disease relation) by asso- guage patterns as “A regulates B” and “expression ciating lexical and syntactic resources with ontolo- of Gene C” which refer to the basic events of regu- gies [1–5]. We further explore the usage of an on- lation and gene expression. These basic events can 36 be combined into an event with the compositional We utilize Gene Regulation Ontology (GRO), a structure “A regulates (the expression of Gene C)”, conceptual model for the domain of gene regulation where the parentheses enclose the embedded event. [9]. The ontology has been designed for representing In this paper, we call such an event consisting of the compositional semantics of both biomedical text multiple basic events a complex event and say that and the referential databases. GRO provides basic it has a compositional structure. We will show that concepts and properties of the domain, which are the use of inference based on domain knowledge sup- from, and cross-linked to, such biomedical ontolo- ports the extraction of complex events from text. gies as Gene Ontology and Sequence Ontology. We The previous approaches to extracting complex use the concepts and properties of GRO to represent events combine the basic events into compositional the domain knowledge in form of P→Q implications, structures according to the syntactic structures of which we call inference rules. We also represent ex- source sentences. However, there are two open is- plicit events from text with GRO and apply modus sues in curating the compositional structures into ponens to the inference rules and the explicit events relational databases. First, the event descriptions to deduce implicit events. in scientific papers are so complicated that it is of- We implemented a system of event extraction ten required to transform the compositional struc- with the proposed inference module and evaluated tures into the structures compatible with the seman- it on three tasks, reporting that the inference signif- tic templates of the target databases. Second, an icantly improves the system performance. event can be represented across sentence boundaries, even in multiple sentences which are not linked via anaphoric expressions (e.g. ‘it’, ‘the gene’). Biologists with sufficient domain knowledge have Results little problem in carrying out the two required tasks We performed three evaluations to test our system. of structural transformation and evidence combina- Each evaluation takes two steps to answer the fol- tion. Structural transformation is to find an event lowing two questions, respectively: 1) How well does that has the same meaning as the original event but the system with the inference module extract events with a different structure, while evidence combina- from text and 2) how much does the inference mod- tion is to identify a new event that can be deduced ule contribute to the event extraction? First, we ran from multiple events. We should encode the domain the system on a manually annotated corpus to es- knowledge into a logical form so that our text min- timate the performance of the system. Second, we ing systems can process the compositional structures used the system for a real-world task of populating of events, which are explicitly expressed in text and RegulonDB, the referential database of E. coli tran- can be extracted by language patterns, to deduce scription regulatory network, to prove the robustness the events with alternative structures and those im- of the system. The first two evaluations are based plied by a combination of multiple events. We call on the corpora used for our previously reported ex- the explicitly expressed events explicit events and periments [10]. Finally, we applied the system to a the deduced events implicit events. related task of extracting regulatory events on cell Several text mining systems have employed in- activities and compared the results with the GOA ference based on domain knowledge to fill in event database [11]. While the first two evaluation tasks templates [6–8]. They can also go beyond sentence focus on E. coli, a prokaryotic model organism, the boundaries and combine into an event frame the last task deals with human genes and cells. event attributes collected from different sentences. Table 1 shows the event templates for the eval- However, they do not use an ontology for represent- uations. The first two evaluations are to extract ing the inference rules. Moreover, they primarily instances of the first three event templates in the ta- deal with flat-structured event frames whose partic- ble, while the last evaluation is to extract instances of the two last event templates. Our system deals ipants are physical entities (e.g. protein, residue). with four properties of events: 1) agents which bind To address these issues, we present a novel approach to gene regulatory regions or control gene expression that represents events and domain knowledge with and cell activities; 2) patients which are regulated by an ontology and combines basic events into a com- the agents; 3) polarity, which tells whether the agent positional structure where an event participant can regulates the patient positively or negatively; and 4) be another simpler event. physical contact, which indicates whether the agent 37 regulates the patient directly by binding or indi- (i.e. agent, patient) are correctly identified, following rectly through other agents. Since the three evalua- the evaluation criteria of the previous approaches [3, 12]. tions only consider the agents and patients, the event Among the 79 events, the system has correctly identi- templates in Table 1 include only the two properties. fied polarity of 46 events (58.2% precision) and physi- Semantic template Gene Ontology cal contact of 51 events (64.6% precision), while these concept two features are not considered for estimating the sys- > To understand the contribution of the inference on > contributes on 53.2% of the correct results. In addition, > We have further focused on the events whose agents > among them (21.6% recall) and incorrectly produced 6 > Table 1. Semantic templates for target events We analyzed the errors of the system as follows: The false positives, in total 15 errors, are mainly due to the inappropriate application of the loose pattern matching method (7 errors) (see the Methods section for details). Evaluation against event annotation The other causes include parse errors (2), the neglect of We evaluated our system first against a manually an- negation (1), and an error in conversion from predicate notated corpus. The corpus consists of 209 MEDLINE argument structure to dependency structure (1). These abstracts that contain at least one E. coli transcription results of error analysis indicate that the three incorrect factor (TF) name. Two curators have annotated E. coli events, which were extracted by the system with the in- gene regulatory events on the corpus and have agreed ference module, are actually due to the incorrect outputs on the final release of the annotated corpus which is of the prior modules (e.g. pattern matching) passed to available online1 (see [10] for details, including inter- the inference module. In short, the inference module annotator agreement). caused no incorrect results. We randomly divided the corpus into two sets: One for system development (i.e. training corpus) and the We also analyzed the false negatives. We found that other for system evaluation (i.e. test corpus). The train- 29.7% of the missing events (88/296) are due to the de- ing corpus, consisting of 109 abstracts, has 250 events ficiency of the gene name dictionary and that 30.0% annotated, while the test corpus, consisting of 100 ab- (68/296) are due to the lack of anaphora resolution. stracts, has 375 events annotated. We manually con- The rest of the missing events (40.3%) are thus depen- structed language patterns and inference rules, based on dent upon pattern matching and inference. It is hard the training corpus and a review paper (see the Methods to distinguish errors by pattern matching from those section for details). by the inference, because the inference module takes The system successfully extracted 79 events from into consideration all semantics from an entire docu- the test corpus (21.1% recall) and incorrectly produced ment (i.e. MEDLINE abstract) for the evidence com- 15 events (84.0% precision). We consider an extracted bination. Therefore, the inference together with the pat- event as correct if its two participants and their roles tern matching affects at most 40% of the false negatives. 1 http://www.ebi.ac.uk/∼kim/eventannotation/ 38 Evaluation against RegulonDB It is remarkable that the inference is inevitable for We tested the system against the real-world task of pop- extracting 93.8% of the RegulonDB events that are ex- ulating RegulonDB with E. coli transcriptional regula- tracted by our system from the corpora. In contrast, the tory events from the literature. We used four corpora inference module is involved in the extraction of only that are relevant to E. coli transcription regulation [10]: 3.2% of the false negative events. The percentage 93.8% 1) the regulon.abstract corpus with 2,704 MEDLINE ab- is much higher than 53.2% of the first evaluation. The stracts which are references of RegulonDB, 2) the regu- difference may be due to the fact that this second eval- lon.fulltext corpus with the fulltexts of 436 references uation only counts unique events, while the first evalu- in RegulonDB, 3) the ecoli-tf.abstract corpus with 4,347 ation against the event annotations counts all extracted MEDLINE abstracts that contain at least one E. coli event instances. If so, these results may indicate that TF name, and 4) the ecoli-tf.fulltext with the fulltexts only a small amount of well-known events are frequently of 1,812 papers among those in the ecoli-tf.abstract. mentioned in papers in concise language forms, thus ex- We have measured the performance of the system tracted by language patterns even without the help of for this evaluation task as follows: The precision is mea- inference, and that the rest of the events are expressed sured as the percentage of events found in RegulonDB in papers with the detailed procedures of experiments among the unique events extracted by the system, while which led to the discovery of the events. the recall is the percentage of the successfully extracted events among those curated in RegulonDB. The ver- sion of RegulonDB used for the evaluation is 6.2, con- Adaptation for regulation of cell activities taining 4,579 E. coli genes, 169 TFs, and 3,590 unique Rule-based systems are criticized for being too specific gene regulation events. This evaluation only consid- to the domains for which they have been developed, so ers events with TFs as agents because of the purpose much so that they cannot be straightforwardly adapted of populating RegulonDB. The overall performance is for other domains. To prove the adaptability of our sys- as follows: F-score 0.44, precision 66.6%, and recall tem, we have applied it to a related topic: Regulation of 33.1%. Table 2 shows the evaluation results over each cell activities. test corpus, where the performance of the system with- The goal of this new task is to populate the GOA [11], out the inference is displayed within pairs of parentheses. concerning two Gene Ontology (GO) concepts: Regula- Corpus Recall Precision F-score tion of cell growth (GO:0001558) (shortly, RCG) and ecoli-tf.abstract 22.4% 77.2% 0.35 regulation of cell death (GO:0031341) (shortly, RCD). (0.3%) (50.0%) (0.01) GOA is a database which provides GO annotations to ecoli-tf.fulltext 24.0% 67.1% 0.35 proteins. In short, the task is to identify the proteins (1.5%) (76.1%) (0.03) that can be annotated with the two GO concepts. The regulon.abstract 17.1% 85.0% 0.28 semantic templates of the two event types are defined in (0.1%) (80.0%) (0.00) Table 1. regulon.fulltext 14.1% 74.0% 0.24 The adaptation included only the following work: We (1.2%) (91.7%) (0.02) manually collected keywords of the concepts ‘growth’ Total 33.1% 66.6% 0.44 and ‘death’ from WordNet and constructed 40 patterns (2.1%) (79.6%) (0.04) for the keywords by using MedEvi [13]. As candidate Table 2. Evaluation against RegulonDB agents, we collected human gene/protein names from Additionally, we analyzed the effect of event types. UniProt. We also collected cell type names from MeSH. The precision for the events of the type “regulation of These are newly built resources that were not required transcription” is 85%, higher than that of [12] (77% pre- for the first two evaluation tasks. Existing language pat- cision), while the overall precision (67%) is predictably terns and inference rules, for example for the concept lower than that since the system of [12] is developed ‘regulation’, were reused. We have not used any training specifically for extracting regulatory events on gene tran- corpus to further adjust the system to the new task. scription. We included the events of the other two types, We constructed a test corpus consisting of 13,136 ab- which are hypernyms of “regulation of transcription”, stracts by querying PubMed with two MeSH terms “Cell into the result set for the evaluation, because of the Death” and “Cell Enlargement”. The system with the low recall for the events of “regulation of transcription” inference module extracted 244 unique UniProt proteins (5%). The overall recall (33%) is still lower than that associated with RCG events and 266 unique proteins as- of [12] (45% recall) because of the small size of the reg- sociated with RCD events from the corpus. This eval- ulon.fulltext corpus (436 fulltexts). Note that [12] ex- uation also uses the two measures: Precision, the per- tracted 42% of RegulonDB events from 2,475 fulltexts centage of unique proteins found in GOA among the ex- of RegulonDB references. We plan to analyze a larger tracted proteins, and recall, the percentage of extracted number of fulltexts in the future. proteins among the protein records in GOA. GOA con- 39 tains 16 proteins among the 244 proteins of RCG events events do not convey new information compared to the (6.6% precision) and 100 proteins among the 266 proteins explicit events. The performance comparison between of RCD events (37.6% precision). Currently (2010 July), the system with the inference and that without the in- the GOA has 155 proteins associated with RCG (10.3% ference is, in a sense, to see which representations better recall) and 908 proteins associated with RCD (11.0% re- fit for the target templates, where the inference rules are call). These results show that our system can be applied designed to produce results that better match the target to a related task with minimal adaptations. templates. We also tested the system without the inference mod- The previous event extraction systems often utilize ule against the cell corpus. It identifies 193 proteins as- rules or models whose semantics directly reflect the tar- sociated with RCG events and 198 proteins associated get event templates, thus embedding linguitic and do- with RCD events. GOA contains 13 proteins among the main knowledge together. In contrast, our approach 193 proteins of ROG events (6.7% precision) and 78 pro- of separating the inference rules from the linguistic re- teins among the 198 proteins of RCD events (39.4% pre- sources has the following characteristics: 1) We can rep- cision). The precision almost does not change even after resent the semantics of sentences, which are relevant to running without the inference module, while the recall event extraction, according to the syntactic structures of drops about 20% without the inference module. This the sentences, independently from target semantic tem- finding is similar to what we found from the results of plates [5]; 2) we can construct language patterns for the second evaluation such that the precision is indepen- event extraction without respect to target semantics, dent from the inference, while the recall drops signifi- considering the compositional aspect of events, which has cantly without the inference module. But the relatively led to the development of phrase-level patterns rather smaller drop of recall for the new task may indicate that than sentence-level or clause-level patterns [14]; and 3) the inference rules developed for the first two evaluations we can add or remove language patterns according to have less effects on the third evaluation than the other their semantic categories, not worrying about the side- two evaluations. effect of domain-specific patterns, which makes the pat- We have manually inspected 20 out of the proteins terns highly reusable, as shown in the third test case. that are extracted by our system but not found in GOA, for each event type. Among the 20 ‘false positive’ pro- teins of the RCD concepts, we found evidence that can support the association of 15 proteins with RCD con- Conclusions cepts (75%). This means that the real precision can go We proposed a novel approach to event extraction, using up to 80% and more importantly that we can identify an ontology to represent the semantics of lexical, syntac- new protein instances of GO concepts by using our sys- tic, and pragmatic resources. We focused on extracting tem. Among the 20 ‘false positive’ proteins of the RCG regulatory events on gene expression and cell activities, concepts, we located evidence only for 8 proteins (40%). which are very important to molecular biology and dis- After careful inspection, we realized that the precision of ease studies. Our system shows the full complexity in the the RCG-related proteins is much lower than that of the identification of such complex events from the literature RCD-related proteins because the language patterns for and may guide the ontology development to innovative RCG events, which we collected from WordNet, are not ways of integrating various knowledge resources. specific to cell size growth, but may also refer to cell pro- liferation and development which should be linked to the other GO concepts “cell proliferation” (GO:0008283) and “cell development” (GO:0048468). The lack of training Methods corpus led to this problem, and so we plan to extend the Our system first recognizes mentions of individual GRO experiment to other GO concepts, establishing training instances in text, which can be the event components. corpora for the concept identification in text. It then combines them into compositional structures of explicit events by using language patterns. The system performs inference based on domain knowledge to de- duce implicit events from the explicit events. It finally Discussion extracts the events that match pre-defined event tem- As explained in the Introduction, the inference rules we plates. Both explicit and implicit events may fit for the introduce in this paper are to deduce implicit events from database event templates. explicit events. Note that unless the explicit events con- Figures 1a and 1b show the examples of the ex- tain enough evidence to an implicit event, we cannot tracted events. Figure 1a depicts the three types of deduce the implicit event from the explicit events. In structures from the input text: Dependency structure, other words, the implicit events are alternative represen- explicit event, and implicit event. An arrow between tations of the extracted information, where the implicit the syntactic and semantic structures indicates a cor- 40 Input Text: In addition, both himA and himD lesions caused a sevenfold reduction in expression of a phi(fimA-lacZ) operon fusion in strains in which fimA was locked in the on phase. Dependency Structure: (caused/VB, (-Subject- lesions/NN, Explicit Event: (-Object- and/CC, -- both/CC, -- a/DT, hasPolarity="negative"> -- sevenfold/JJ, Input Text: The function of OmpR appears to be the enhancement of a basal level of (-- in/IN, hasPatient = ompC expression. From the results of our experiments, the site of action of OmpR was (-Object- expression/NN, > hasAgent = hasAgent = hasPart = -- )/RRB, hasPolarity="negative">> hasPatient = > name="OmpR">>> hasPolarity="positive"> Implicit Event (fit for Database Template): hasAgent = hasPatient = > hasPatient = > hasPolarity="positive"> hasPolarity="positive" hasPhysicalContact="yes"> (a) Example 1 (b) Example 2 Figure 1: Examples of event extraction Input text Named entity recognition Lexicon Named entity annotated text Parsing Parser Dependency structure Syntactic-Semantic Paired Pattern matching GRO Patterns (Table 3) (Explicit) Textual semantics represented with GRO Inference Rules Inference (Table 4) (Explicit+Implicit) Textual semantics Database Template Extraction Semantics (Table 1) Events of pre-defined types Figure 2: System workflow respondence link between two structures for a phrase. No. Syntactic pattern / Semantic pattern 1 (expression Noun (of Prep Object:Gene)) / The explicit event is composed from phrasal structures to sentential structures by using the patterns in Table 3. 2 (reduction Noun (in Prep Object:Patient)) / The implicit event is deduced from the explicit events by for TranscriptionFactorBindingSiteOfDNA. Figure 1b 3 (lesion Noun Object:Patient) / shows that the explicit events of the two sentences are 4 is used for the deduction. The overall workflow of the 4 (cause Verb Subject:Agent Object:Patient) / Table 3. Example patterns 41 No. Condition(s) ⇒ Conclusion ing the syntactic patterns to the dependency structures 1 > Each pattern is a pair of a syntactic pattern and a ⇒ > surrounded by another pair of parentheses, which forms ⇒ dicate their roles with respect to their head items (e.g. 4 > dicate the semantics of the dependent items. If the se- > (e.g. Gene), the variable should match a semantic cate- ⇒ corresponding syntactic pattern. The semantic pattern hasPhysicalContact=“yes”> is represented with GRO concepts (e.g. RegulatoryPro- Table 4. Example inference rules cess, GeneExpression) and properties (e.g. hasAgent, hasPatient). The system tries to match the syntactic patterns to Named entity recognition the dependency structures of sentences in a bottom-up We have adopted a dictionary-based approach for named way. For example, it matches from Pattern 1 to Pattern entity recognition. The dictionary contains 15,881 4 in Table 3 to the dependency structure of the example gene/protein and operon names of E. coli, including (1) depicted in Figure 1a. In the process, it considers the names of 169 E. coli TF names, collected from the syntactic and semantic constraints of the syntactic RegulonDB and SwissProt. The recognized names are patterns. For instance, the item ‘cause’ of the fourth grounded with UniProt identifiers and labeled with rel- pattern in Table 3 should match the verb ‘cause’ that evant GRO concepts among the followings: Gene, Pro- has both a subject and an object. tein, Operon, and TranscriptionFactor. Once a syntactic pattern is successfully matched to a node of dependency structure, its corresponding seman- tic pattern is assigned to the node as one of its seman- Parsing tics. If the syntactic pattern has dependent items with We have utilized Enju, the HPSG parser [15], for syn- semantic variables (e.g. Subject:Agent, Object:Patient), tactic analysis of sentences. While the Enju parser pro- the variables (e.g. Agent, Patient) are replaced with the duces predicate-argument structures, we have developed semantics of the children of the node that have been a module to convert them into dependency structures matched to the dependent items. In this way, the se- and selectively merged the predicate-argument structure mantics of multiple phrases is combined into sentential into the dependency structure. We have identified the semantics. In Figure 1a, the small boxes with dashed dependency structure for the loose matching of language lines show the semantics assigned to the internal nodes patterns explained below. of the example (1), which are later combined into the textual sentential semantics. Note that the node ‘lesions’ is assigned two pieces of Pattern Matching semantics for the two gene names that are the children To identify the explicit events from sentences, the sys- of the node (i.e. himA, himD). The explicit textual se- tem utilizes syntactic-semantic paired patterns, match- mantics of Figure 1a is one of the two, while the other is 42 a duplicate of Sem1 except that the gene name ‘himA’ plicit events in Figure 1a has a cascaded structure with is replaced with ‘himD’. four basic event instances (i.e. three RegulatoryProcess, One important feature of the pattern matching is one GeneExpression) and is transformed by Rules 1 and that we loosely match the syntactic patterns to the de- 2 to fit for the database template that has only two event pendency structures. For instance, the gene name ‘fimA’ instances (i.e. RegulationOfGeneExpression, GeneEx- is not a direct child of the preposition ‘of’, but is matched pression). Rule 3 deduces the specific event type Regula- to the item Object:Gene of the first pattern in Table 4. tionOfGeneExpression from a general type of event (i.e. We have decided to match a dependent item not only RegulatoryProcess). Rule 4 reflects the domain knowl- to a direct child of the node matched to the head item, edge that if a transcription factor both binds to the reg- but also to any descendant of the node. The feature is ulatory region of a gene and regulates the gene’s expres- based on two reasons: First, it is practically impossible sion level, it is the transcriptional regulator of the gene. to construct all potential patterns for the event extrac- Note that the two conditions of Rule 4 can be matched tion, though a reasonably large number of patterns for to events from any sentences; in other words, Rule 4 gene regulation have been accumulated; and second, the can merge multiple evidence from different sentences into lexical entries not matched to any of the patterns for gene a fact. The function polarity sum works exactly like regulation (e.g. ‘sevenfold’, ‘operon’, ‘fusion’) might not NXOR (Not Exclusive OR) operation in Boolean logic. affect the extraction of the events. The rules are repeatedly applied over the explicit events This loose matching still works under the following from a given text until no additional event is generated. strict conditions: 1) An item with a syntactic role (e.g. We have implemented a program that converts the Subject) can be matched to one of descendants under inference rules into Prolog programming codes and a the sub-tree with the syntactic role; 2) once an item Prolog application that executes the rules over input is matched to a node, it is not further matched to the events. We could not use the OWL-DL reasoners (e.g. node’s descendants; and 3) it does not jump over clausal Pellet) because of the DL-safe restriction of the rea- boundaries (e.g. ‘which’) and several exceptional words soners. DL-safe restriction assumes that all instances (e.g. ‘except’). of rules, both in conditions and in conclusions, should be available at the knowledge base [17]. Unfortunately, however, the rules for the event extraction generate new instances of events and event attributes in the conclu- Inference sions. Nonetheless, we can still utilize the reasoners The inference step is to transduce explicit textual se- to validate the ontology populated with the extracted mantics (or events) into implicit semantics (or events). events. It deduces a new specific event instance, if possible, by combining any two or more general events. The inference module takes as input the explicit events from a text (i.e. a MEDLINE abstract, a fulltext) identified by the Extraction previous module of pattern matching. It applies to the The system finally selects the events that match given se- explicit events the inference rules that reflect common mantic templates among those resulted from either pat- sense knowledge and domain knowledge, as exemplified tern matching or inference. Table 1 shows the event in Table 4. templates. The variables are marked with ‘?’ and are An inference rule has the propositional logic form of matched to the instances of the concepts referred to by P→Q, where P is a set of conditions and Q is the conclu- the variables. For example, the variable “?Protein” can sion. It works with the modus ponens rule (i.e. P, P→Q be matched to a protein name. Non-variable concepts ⊢ Q). That is, if all the conditions P of a rule match and properties are used as semantic restriction on the some of the identified events from a text, the conclusion events to extracted. For example, the last template in Q is instantiated and then added as an additional event Table 1 can be matched to an instance of NegativeReg- of the text. As the input events are represented with ulation, which a child of RegulatoryProcess. In addi- GRO, the inference rules and their resultant events are tion, the patient of the instance should an instance of also represented with GRO. CellDeath and the agent can be a gene, where Gene is a We have constructed 28 inference rules for deal- descendant of MolecularEntity. ing with the compositional structures of gene regulation events (e.g. Rules 1, 2) and for deducing biological events from the combination of linguistic events (e.g. Rules 3, 4) by consulting the training corpus and the review pa- Authors contributions per [16] (see Table 4). JJK conceived the study, designed and implemented For example, Rules 1 and 2 flatten, if possible, the the system, carried out the evaluations and drafted the compositional structure of event descriptions. The ex- manuscript. DRS motivated and coordinated the study 43 and revised the manuscript. 8. Culotta A, McCallum A, Betz J: Integrating proba- bilistic extraction models and data mining to dis- cover relations and patterns in text. In Proceedings of Human Language Technology Conference of the North Acknowledgements American Chapter of the Association of Computational We would like to thank Vivian Lee and Ruth Lovering Linguistics 2006:296–303. for their contribution on the event annotation, and Nick 9. Beisswanger E, Lee V, Kim JJ, Rebholz-Schuhmann D, Luscombe and Aswin Seshasayee for their helping us to Splendiani A, Dameron O, Schulz S, Hahn U: Gene Reg- learn the domain knowledge of gene transcription reg- ulation Ontology (GRO): Design Principles and Use Cases. Studies in Health Technology and Informat- ulation. We would also like to thank the anonymous ics 2008, 136:9–14. reviewers for their valuable comments. 10. Hahn U, Tomanek K, Buyko E, Kim JJ, Rebholz- Schuhmann D: How Feasible and Robust is the Au- tomatic Extraction of Gene Regulation Events? References A Cross-Method Evaluation under Lab and Real- 1. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin Life Conditions. In Proceedings of BioNLP 2009 A, Mazo I: Extracting human protein interactions 2009:37–45. from MEDLINE using a full-sentence parser. 11. Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan Bioinformatics 2004, 20(5):604–611. C, Apweiler R: The GOA database in 2009 – an 2. Cimiano P, Reyle U, Saric J: Ontology-based dis- integrated Gene Ontology Annotation resource. course analysis for information extraction. Data & Nucleic Acids Research 2009, 37:D396–D403. Knowledge Engineering 2005, 55:59–83. 12. Rodrı́guez-Penagos C, Salgado H, Martı́nez-Flores I, 3. Saric J, Jensen LJ, Rojas I: Large-scale extraction of Collado-Vides J: Automatic reconstruction of a bac- gene regulation for model organisms in an onto- terial regulatory network using Natural Language logical context. Silico Biology 2005, 5:21–32. Processing. BMC Bioinformatics 2007, 8:293. 4. Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, 13. Kim JJ, Pezik P, Rebholz-Schuhmann D: MedEvi: Re- Ogren PV, Cohen KB: OpenDMAP: an open source, trieving textual evidence of relations between ontology-driven concept analysis engine, with ap- biomedical concepts from Medline. Bioinformatics plications to capturing knowledge regarding pro- 2008, 24(11):1410–1412. tein transport, protein interactions and cell-type- 14. Kim JJ, Chae YS, Choi KS: Phrase-Pattern-based specific gene expression. BMC Bioinformatics 2008, Korean to English Machine Translation using Two 9:78. Level Translation Pattern Selection. In Proceedings 5. Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Overview of 38th Association for Computational Lingusitics (ACL) of BioNLP’09 shared task on event extraction. In 2000:31–36. Proceedings of the Workshop on BioNLP: Shared Task 15. Sagae K, Miyao Y, Tsujii J: HPSG parsing with shal- 2009:1–9. low dependency constraints. In Proceedings of the 6. Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P: 45th Annual Meeting of the Association of Computa- Protein structures and information extraction tional Linguistics, Prague, Czech Republic 2007:624–631. from biological texts: The PASTA system. Bioin- 16. Browning DF, Busby SJW: The regulation of bacte- formatics 2003, 19:135–143. rial transcription initiation. Nature Reviews Microbi- 7. Narayanaswamy M, Ravikumar KE, Vijay-Shanker K: ology 2004, 2:57–65. Beyond the clause: extraction of phosphorylation 17. Motik B, Sattler U, Studer R: Query answering for information from medline abstracts. Bioinformatics OWL-DL with rules. Web Semantics: Science, Ser- 2005, 21(Suppl. 1):i319–i327. vices and Agents on the World Wide Web 2005, 3:41–60. 44