Extracting Evidence Fragments for Distant Supervision of Molecular Interactions Gully A Burns1 , Pradeep Dasigi2 , and Eduard H. Hovy2 1 USC Information Sciences Institute, Marina del Rey, CA 90292, USA burns@isi.edu 2 Language Technologies Institute - Carnegie Mellon University, Pittsburgh, PA 15213, USA {pdasigi,hovy}@cs.cmu.edu Abstract. We describe a methodology for automatically extracting ‘ev- idence fragments’ from a set of biomedical experimental research articles. These fragments provide the primary description of evidence that is pre- sented in the papers’ figures. They elucidate the goals, methods, results and interpretations of experiments that support the original scientific contributions the study being reported. Within this paper, we describe our methodology and showcase an example data set based on the Eu- ropean Bioinformatics Institute’s INTACT database (http:www.ebi.ac. uk/intact/). Using figure codes as anchors, we linked evidence fragments to INTACT data records as an example of distant supervision so that we could use INTACT’s preexisting, manually-curated structured inter- action data to act as a gold standard for machine reading experiments. We report preliminary baseline event extraction measures from this col- lection based on a publicly available, machine reading system (REACH). We use semantic web standards for our data and provide open access to all source code. Keywords: Machine Reading, Molecular Interactions, Biomedical In- formatics, Discourse Analysis 1 Introduction The biomedical literature consists of tens of millions of published articles [1] and there are thousands of informatics systems that catalog both published and unpublished scientific work [2]. These databases are typically constructed manually and there is therefore a very strong need to automate extraction of information from research articles using machine reading approaches. We are attempting to explore whether extracting and representing primary experimental evidence will provide a more accurate, and scoped target for machine reading than simply attempting to read all text in the body of a paper article with equal priority [3]. This report provides the starting point of our investigation by identifying which fragments of an experimental article’s narrative specifically describe the experimental contribution of that article. 2 Burns, et al. In order to develop machine reading systems, we require training data that links the text of research papers to structured semantic representations of the knowledge content. We describe a general method for creating annotated corpora based on distant supervision to create links between text describing research evidence to previously-curated database records. We seek to use figure references in the text of articles to create a useful link between text and data (Figure 1). The European Bioinformatics Institute’s (EBI) INTACT database describes molecular interactions (binding events where two molecules join to form a com- plex). INTACT links each figure reference (i.e., 1a, 2b, 5f, etc.) directly to database records [4]. Figure 1 illustrates how evidence fragments might then be linked to database records via their common figure reference. Biomedical Semantic Database Representation Evidence Figure Record Fragment Reference (BioPAX) (INTACT) Fig. 1. Figure references can link relevant fragments from full-text primary research articles to database records and derived semantic representations. We automated this linkage between database records and evidence fragments to provide a cost-effective way of creating corpora. We applied an open-source event extraction method for signaling pathway events (REACH) [5] to develop a baseline for detailed semantic extraction of this text. 2 Related Work In biomedicine, distant supervision was originally used to facilitate entity and relation extraction from text using structured data [6]. Previous efforts center around record linkage between domain-specific biomedical entities (such as pro- teins and residues, see [7]). The method we use to tag discourse elements is simpler than general discourse parsing methods (such as Rhetorical Structure Theory (RST) [8]), which might be applied to open domain text. More pre- cisely, our work mirrors that of Teufel et al. concerned with “Argumentative Zoning” where classifiers act on sentences across the entire narrative scope of a paper [9].We seek a more restricted focus in order to isolate a paper’s primary experimental contribution for subsequent extraction, Aydin et al. describes a closely-related study in which they classify passages with experimental methods Extracting Evidence Fragments for Distant Supervision 3 with PSI25-MI terms (the same terminology used in INTACT) [10]. They focus on methodological text and the size of their annotated corpus (30 papers) reflects the important role of annotated corpora in information extraction. We suggest that our use of distant supervision could increase the size of their working corpus. 3 Methods 3.1 INTACT Data and Text Preprocessing We only used INTACT papers that had been designated as part of the open access subset of Pubmed Central’s online digital collection. Our INTACT data contains 13,991 papers of which 1,063 were available for use. To split sentences into their constituent clauses, we computed dependency parses with the Stanford Lexicalized Parser. INTACT data was downloaded and cross referenced to the open access publications with figure references to yield 899 papers containing 6320 individual reported reactions of molecular interactions. 3.2 Science Discourse Tagger - Neural Net Classifier We used the Science Discourse Tagger (SciDT) [11] to annotate individual sub- sentence clauses from scientific papers with one of eight discourse tags including ‘fact’, ‘’problem’, ‘hypothesis’, ‘goal’, ‘’method’, ‘’result’, and ‘none’ [12]. Train- ing data was manually compiled from 20 papers. We ran release v0.0.2 from the SciDT and SciDT Pipeline github repositories. 3.3 Linking Figure References to Surrounding Text We used a rule-based approach to locate the sentence boundaries of text pertain- ing to specific subfigures. Figure 2 shows an example from [13]. This shows the delineation of text passages pertaining to the evidence presented in subfigures 1A, 1B and the first sentence of the description of 1C. Color coding of sentences shows the discourse tags associated with each clause shown. Informally, the algorithm to extract these fragments is as follows: For each subfigure reference in the text, we first scan backwards from clause containing a figure reference mention (e.g., ‘Fig. 1 A’) for the start of the evi- dence fragment. We assert the presence of a fragment start boundary between consecutive sentences S1 and S2 (i.e., S2 is the first sentence of the evidence fragment) if the following conditions are met: a. Sentence S1 contains either (a) clauses that are tagged as ‘hypotheses’, ‘problems’, or ‘facts’ or (b) clauses that are tagged as ‘results’ or ‘implications’ that also contain external citations. and sentence S2 also contains either 4 Burns, et al. Fig. 2. Evidence text fragments referring to subfigures 1A, 1B and 1C of [13]. (a) clauses that are goals or methods or (b) results/implications with no external citations. b. both S1 and S2 contain references to subfigures that are entirely disjoint (i.e., S1 refers to ‘Fig. 1C’ and S2 refers to ‘Fig. 1D, 1E and 1F’). c. S2 is a section heading, indicating that the S1 /S2 boundary marks a transi- tion between sections. Similarly, we repeated this process by scanning forward from the figure refer- ence mention for the following conditions between consecutive sentences S1 and S2 indicating that S1 was the last sentence of the evidence fragment: a. Sentence S1 contains only clauses that are tagged as as ‘results’ or ‘implica- tions’ without citing external papers and Sentence S2 also contains only Extracting Evidence Fragments for Distant Supervision 5 (a) clauses that are tagged as ‘goals’, ‘methods’, ‘hypotheses’, ‘problems’, ‘facts’ or ‘methods’ or (b) clauses that are tagged as ‘results’ or ‘implications’ with external cita- tions present. Conditions b. and c. headings were applied as before to detect the start of evidence fragments. 3.4 Applying the REACH event extraction tool REACH is an event extraction engine for molecular signaling [5]. We applied REACH to INTACT open access papers and cross-referenced outputs to those linked to specific subfigures also referenced by INTACT data records. The only event type in REACH dealing with molecular interaction are ‘Complex Assem- bly’ events which we compared to data specified by INTACT data records to generate baseline event-extraction statistics. 3.5 Building the Molecular Interaction Evidence Fragment Corpus We developed an OWL-based implementation of the existing BioC formulation [14], extended the SciDT pipeline system to export linked data conforming to that model. Also, we used the ’Semantic Publishing and Referencing’ (SPAR) ontologies for bibliographic elements and references in both bioc and biopax linked data sets [15]. We used Paxtools [16] to convert INTACT PSI-MI2.5 data to BioPax (with a minor adaption to include figure references in the biopax representation of evidence). 4 Results 4.1 Discourse Tagging In [12], Dasigi et al. evaluated 5-fold cross-validation Accuracies and F-Scores for SciDT based on a training set of 2,678 clauses over 263 paragraphs from results sections (Accuracy = 0.75, F-Score = 0.74). We extended this training data over all sections of the paper to yield 654 paragraphs with 6629 clauses. Of these, 253 paragraphs were from results sections yielding 2802 clauses. 4.2 Computing Figure Spans within Documents Figure 3 illustrates the output of this procedure as a Gantt chart of the spans of subfigures over the clauses in a single paper’s results section. This shows how experiment references punctuate the argument of the paper with factual evidence. It also shows explicitly how a single paper in this domain is structured around a large number of small-scale experiments (23 in this case). We evaluated 6 Burns, et al. our methodology on a mixed set of manually annotated 10 open access papers (involving 190 figure references). This evaluation (of correctly identifying a figure reference for a given clause) gave macro average Precision = 0.66 ± 0.02, Recall = 0.87 ± 0.02 and F-score = 0.76 ± 0.01. Fig. 3. Gantt chart distribution of experimental spans for [13]. Red crosses show po- sitions of subfigure references. Discourse type colors: ‘fact’/‘hypothesis’/‘problem’ = white; ‘goal’ = light gray; ‘method’ = gray; result = ‘light blue’; ‘implication’ = light green. 4.3 The Molecular Interaction Evidence Fragment Corpus We have released all data associated with the study on FigShare [17]. The data consists of a compressed archive of individual files for papers’ evidence fragments and intact data records. 4.4 REACH System Output We ran REACH over all available open source documents in INTACT. Of the 6320 INTACT records with associated figure references, we were able to identify a ’Complex Assembly’ event within the sentences our system designated as asso- ciated with each event 2747 times (43.47% of records). The most precise measure Extracting Evidence Fragments for Distant Supervision 7 of event extraction accuracy is based on matching the UNIPROT identifiers of any proteins described in the extracted REACH event to those of the INTACT data record. REACH was able to precisely reconstruct the INTACT data record to that level of accuracy in only 356 cases (5.6% of records). This provides a baseline measurement for future work. 5 Discussion We have sought to instantiate a novel methodology for distant supervision in biomedical text mining and to provide the community access to a mid-sized text corpus for future use. Although our event extraction experiments showed poor performance, this provides a baseline for off-the-shelf tools that we expect to be able to improve upon straightforwardly. We would like to extend this to work with argumentation graphs where claims may be linked from other parts of papers [18,19]. Developing methods to automatically create such graphs across papers may provide powerful new ways of examining the literature. Machine reading depends on the natural redundancy of any scientific narra- tive where common assertions are stated and restated in different ways across papers. On aggregate, these systems extract structured data from sentences that cite other work. This is problematic, since when evaluated for correctness, ci- tation statements are often inaccurate [20]. More seriously, citations are both retained and reused within the literature even after the work that they are cit- ing has been retracted [21]. Thus, a key, original focus of this work is to focus on the assertions that summarize the primary findings of a given paper rather than seek to use any and all available language to use for machine reading tasks. Acknowledgments. This work was funded by DARPA Big Mechanism pro- gram under ARO contract W911NF-14-1-0436. We thank Anita de Waard, Mihai Surdeanu, Clay Morrison, and Hans Chalupsky for their contributions. References 1. National Library of Medicine 2016 MEDLINE/PubMed Baseline Database Distri- bution: File Names, Record Counts, and File Size. https://www.nlm.nih.gov/bsd/ licensee/2016_stats/baseline_med_filecount.html 2. Galperin, M.Y., Fernandez-Suarez, X.M., and Rigden, D.J. (2017). The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes. Nucleic Acids Res. 3. Burns, G.A.P.C., and Chalupsky, H. (2014). Its All Made Up - Why we should stop building representations based on interpretive models and focus on experi- mental evidence instead. In Discovery Informatics: Scientific Discoveries Enabled by AI, (Quebec City, Quebec), https://www.nlm.nih.gov/bsd/licensee/2016_ stats/baseline_med_filecount.html 4. Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., Broackes-Carter, F., Campbell, N.H., Chavali, G., Chen, C., del-Toro, N., et al.: The MIntAct project– IntAct as a common curation platform for 11 molecular interaction databases. Nu- cleic Acids Res 42, D358-363 (2014). 8 Burns, et al. 5. Valenzuela-Escrcega, M.A., Hahn-Powell, G., Hicks, T., and Surdeanu, M.: A Domain-independent Rule-based Framework for Event Extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: Software Demonstrations (ACL- IJCNLP), (ACL-IJCNLP 2015), pp. 127–132 (2015). 6. Craven, M., and Kumlien, J. (1999). Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, (AAAI Press), pp. 7786. 7. Ravikumar, K., Liu, H., Cohn, J.D., Wall, M.E., and Verspoor, K. (2012). Litera- ture mining of protein-residue associations with graph rules learned through distant supervision. J Biomed Semantics 3 Suppl 3, S2. 8. Mann, W.C., and Thompson, S.A. (1987). Rhetorical structure theory: A theory of text organization (USC, Information Sciences Institute). 9. Teufel, S., and Kan, M.-Y. (2011). Robust argumentative zoning for sensemaking in scholarly documents. In Advanced Language Technologies for Digital Libraries, (Springer), pp. 154170. 10. Aydin, F., Husunbeyi, Z.M., and Ozgur, A. (2017). Automatic query generation using word embeddings for retrieving passages describing experimental methods. Database (Oxford) 2017. 11. Scientific Discourse Tagger Pipeline Release, https://github.com/BMKEG/ sciDT-pipeline/releases/tag/0.0.2 12. Dasigi, P., Burns, G.A.P.C., Hovy, E., and Waard, A. de (2017). Experiment Seg- mentation in Scientific Discourse as Clause-level Structured Prediction using Re- current Neural Networks. arXiv:1702.05398, https://arxiv.org/abs/1702.05398 13. Innocenti, M., Tenca, P., Frittoli, E., Faretta, M., Tocchetti, A., Di Fiore, P.P., and Scita, G. (2002). Mechanisms through which Sos-1 coordinates the activation of Ras and Rac. J Cell Biol 156, 125136. 14. BioC Linked Data http://purl.org/bioc 15. Peroni, S. (2014). The Semantic Publishing and Referencing Ontologies. In Seman- tic Web Technologies and Legal Scholarly Publishing, (Cham: Springer International Publishing), pp. 121193. 16. Demir, E. et al. Using biological pathway data with paxtools. PLoS Comput Biol 9, e1003194 (2013). 17. Burns, G., Hovy, E.H., and Dasigi, P. (2017). Molecular Interaction Evidence Frag- ment Corpus. https://doi.org/10.6084/m9.figshare.5007992.v4 18. Clark, T., Ciccarese, P.N., and Goble, C.A. (2014). Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communica- tions. J Biomed Semantics 5, 28. 19. Bolling, C., Weidlich, M., and Holzhutter, H.-G. (2014). SEE: structured repre- sentation of scientific evidence in the biomedical domain using Semantic Web tech- niques. J Biomed Semantics 5, S1. 20. Lopresti, R. (2010). Citation accuracy in environmental science journals. Sciento- metrics 85, 647655. 21. Bustin, S.A. (2014). The reproducibility of biomedical research: Sleepers awake! Biomolecular Detection and Quantification 2, 3542.