Cycles of Scientific Investigation in Discourse Machine Reading Methods for the Primary Research Contributions of a Paper Gully A. Burns Anita de Waard Pradeep Dasigi Eduard H. Hovy ISI, USC Elsevier Research Data Services LTI, CMU LTI, CMU Marina del Rey, CA, USA Jericho, VT, USA Pittsburgh, PA, USA Pittsburgh, USA gully@usc.edu a.dewaard@elsevier.com pdasigi@cs.cmu.edu hovy@cmu.edu Abstract— We describe a novel approach to machine reading of scientific papers to a set of representations that support cross- the primary scientific literature. We treat a description of an paper tracking, comparison of ideas, hypothesis evolution, etc. experiment as a discourse, viewing a scientific corpus not merely This facilitates the understanding of how experimentally- into a collection of documents, but also an extended conversation founded knowledge is created and developed over time and formed by the collective set of experiments, their introductions space by a disjointed scholarly community, through processes and interpretations. This paper introduces this approach as a of reading, writing, and experimentation. methodology called ‘Cycles of Scientific Investigation in Discourse’ (CoSID). In CoSID, we capture the central conceptual To capture how experiments are presented in technical structure of a paper as a series of nested reasoning loops, publications we create in CoSID three layers of composed of passages in results sections, which describe representation,each being a frame with associated properties: individual research findings. We ground our work with a number of worked examples based on data from the MINTACT 1. Context — the conceptual framework about some and Pathway Logic databases, and illustrate the idea in the phenomenon. In principle this exists ‘outside’ any particular context of machine-enable biocuration1. paper, but for any paper, it provides the framework for all experiments within it (and also forms a localized context for Keywords—interpretive framework for experiments, experiment experiments from a single section of a paper). We model this description as discourse, computational language technology with a computational frame structure that includes slots for hypotheses, pointers to experiments, a description of the I. THEORETICAL BACKGROUND overall interrelation of experiments and interpretation, etc. All experiments consist of a series of actions performed 2. Experiment — a series of physically instantiated upon entities, conducted for a reason, ending with a activities governed by a goal and hypothesis, resulting in measurement/evaluation of something and an interpretation as observations and measurements. Generally a technical paper conclusion. But people do not conduct experiments in a containes many experiments (each possibly only briefly vacuum. Experiments are formulated to explore possibilities described). Each one explores some specific combination of within a larger encapsulating theory, and their conclusions are parameter values, and is modeled by a frame whose slots intended to flesh out the unknown parts of the theory. They provide the goal, method, observed results, specific can therefore be viewed as ‘interaction turns’ in an ongoing experimental implications, etc. discourse, with internal linkage among corresponding portions (specific goals, hypotheses, conclusions, etc.). 3. Interpretation — the interpretations drawn from one or more experiments, leading back to the overall interpretation in Experiments, by their nature, are specific: actions situated the Context (above). Each experiment’s local hypothesis in time and space, performed with physical objects. Theories, makes up a part of the global hypothesis of the Context. in contrast, are by their nature general, intended to apply beyond the particular time and place of the experiment. They We represent a CoSID frame as a nested structure where a employ abstractions that any particular experiment has to single Context associates with multiple Experiments and is instantiate as its artifacts and activities. Since theories are concluded with a single Interpretation. Each CoSID frame is ‘conceptual’ while experiments are ‘practical’ in nature, it may derived from a passage in the results section that points to be very difficult for an experiment to serve as an absolute proof subfigures that each report individual experiments. Figure 1 for any theory for all time and space. shows the application of CoSID to a sample article (pmid: 10533201) where the discourse structure of a single frame CoSID (Cycles of Scientific Investigation in Discourse) is a (Fig1AB) is explored. This frame consists of 12 clauses model of experimentational text that takes into account these moving from facts to methods, results, and interpretation, to two points of view. If we postulate that that scientific inform the frame structure as described above. investigation proceeds in cycles of increasing theoretical specificity (each round of experiments serving to inform the II. CORPORA AND DATA next round of conceptual expansion), the CoSID model provides a formalization that abstracts from the text of Our overall goal is to produce automatically for a given scientific paper a set of instantiated CoSID frames, all properly connected, that completely and accurately reflect its contents. 1 To this end, we have to perform multiple quite distinct tasks, This work was funded by DARPA Big Mechanism program under ARO contract W911NF-14-1-0436. including determining the overall goals, backgroumd, and a Conditional Random Field (CRF) model to assign types to hypotheses of the paper, identifying where individual experi- these experiments (from either the PSI-MI2.53 or the Pathway A: Whole Article B: Preliminary CoSID Frame Logic typology). Separately, we parse the text into discourse e.g., Zhang et al 1999, pmid10544201 e.g., 10544201, Fig 1AB segments, each roughly a clause, and identify for each one a Introduction (Context) 12 natural language clauses: Discourse Segment type, along the lines of [3]. These types Methods Results (Experiment) fact, fact, fact, method, fact, method, result, result, result, include the labels ‘fact’, ‘hypothesis’, ‘problem’, ‘goal’, result, interpretation, ‘method’, ‘result’, and ‘interpretation’. As a third component, Context Experiment interpretation we are working to identify the theoretical model that underlies Fig 1A+B Experiment CoSID Frame Experimental Type: each paper, which will form part of the Context frame. Experiment Northern Blot, Total RNA analysis Interpretation 1-3 IV. EARLY RESULTS Context Experiment 4 5 Method 7-10 To date, we have implemented several modules, including: Fig 1C Fact Experiment CoSID Frame (A) A caption splitter that uses rules to identify individual Experiment 6 Interpretation Result experiments inside captions. Performance is >95%. (B) An Interpretation 11 experiment delimiter that uses rules to delimit the extent of ... 20+ CoSID Frames for all experiments each experiment description in the Results section of the paper. ... 12 (C) An experiment type tagger. We experimented with Discussion (Interpretation) different numbers of types, sometimes condensing the less- frequent ones together (F1-score: 71%). (D) A discourse Figure 1: Applying CoSID frames to a research article. A: Overall segment type tagger, a trained CRF model to assign a discourse structure of frames within the textual narrative, B: Discourse segment tag to each clause (F1-score: 66%). structure within a frame showing transition between discourse types. V. NEXT STEPS ment boundaries lie, understanding each of them individually, connecting everything together, and then creating the This work is ongoing. After completing the missing appropriate interlinked frame structures. components we plan to make available a collection of Ras cancer papers with associated CoSID frames, which we believe This work is performed in the RUBICON project, funded will be useful within the FRIES consortium in the Big by DARPA’s Big Mechanisms program, that is extracting Mechanism program. FRIES includes groups building systems relevant facts from a vast collection of papers about Ras that extract atomic information about entities and relations cancers and formulating them to support theoretical model from papers about Ras cancer research, individuals creating builders, automated reasoners, and actual experimenters [1]. models of Ras cancer and associated experiments, and groups Our contribution is to provide rich contexts in which individual building automated modeling and reasoning systems. atomic statements about biological entities, extracted by others, can be properly interpreted (for example, as hypothetical or as Some uses for our work include: downweighting the actual, or as a local interpretation drawn from an experiment, certainty score for assertions that have been tagged as or one drawn from some other work reported). hypotheses, compared to facts; downweighting the assertions from high-level conclusions, as compared to direct We focus on the text associated with the subfigure (i.e., experimental findings, since the former may suffer from Fig. 1A, 3C, etc.) and develop classifiers for the type of misconstrual; allowing models to cross-link experiments from experiments performed. To test our work we compare to two different papers when their Experiment frames are similar manually curated models of the data: The Pathway Logic enough (i.e., they apply the same experimental techniques in group at SRI International contain approximately 2,000 papers the same settings to the same materials); and more. of which 76 are open-access. Each data record is assigned one We welcome suggestions for additional uses and extensions of 33 separate ‘assay types’ 2 (such as ‘coprecipitation’, of the CoSID model. ‘phosphoryation’, etc). Similarly, the MINTACT database provides hand-curated records of 37,268 experiments from REFERENCES 14,009 papers, of which 1,063 are available as open access [1] Cohen, P.R. DARPA’s Big Mechanism program. Phys Biol 12, 045008 papers [2]. (2015) [2] Orchard, S. et al. The MIntAct project—IntAct as a common curation III. WORK TO DATE platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–363 (2014). Our first step is to delimit each experiment. We accomplish this by processing the caption of each Results section figure. [3] de Waard, A. and Pander Maat, H.L.W., Verb form indicates discourse segment type in biological research papers: Experimental evidence Accuracy within captions is essentially perfect, given helpful Journal of English for Academic Purposes, 11 (4), (pp. 357-366), phrases like “Figure 2(a) depicts…”. Using this, we search doi:10.1016/j.jeap.2012.06.002 within the Results section text to find a reference to the corresponding portion of the figure, as in “As shown in Fig 2(a),…”. This forms the anchor for a span of text that, we assume, provides details about a single experiment. We trained 2 3 http://pl.csl.sri.com/CurationNotebook/pages/Assays.html http://www.psidev.info/node/60