=Paper= {{Paper |id=Vol-1747/BT102_ICBO2016 |storemode=property |title=Cycles of Scientific Investigation in Discourse - Machine Reading Methods for the Primary Research Contributions of a Paper |pdfUrl=https://ceur-ws.org/Vol-1747/BT102_ICBO2016.pdf |volume=Vol-1747 |authors=Gully A. Burns,Anita de Waard,Pradeep Dasigi,Eduard H. Hovy |dblpUrl=https://dblp.org/rec/conf/icbo/BurnsWDH16 }} ==Cycles of Scientific Investigation in Discourse - Machine Reading Methods for the Primary Research Contributions of a Paper == https://ceur-ws.org/Vol-1747/BT102_ICBO2016.pdf
          Cycles of Scientific Investigation in Discourse
           Machine Reading Methods for the Primary Research Contributions of a Paper
         Gully A. Burns                  Anita de Waard                   Pradeep Dasigi                    Eduard H. Hovy
            ISI, USC              Elsevier Research Data Services            LTI, CMU                          LTI, CMU
    Marina del Rey, CA, USA             Jericho, VT, USA               Pittsburgh, PA, USA                  Pittsburgh, USA
         gully@usc.edu              a.dewaard@elsevier.com             pdasigi@cs.cmu.edu                   hovy@cmu.edu


Abstract— We describe a novel approach to machine reading of        scientific papers to a set of representations that support cross-
the primary scientific literature. We treat a description of an     paper tracking, comparison of ideas, hypothesis evolution, etc.
experiment as a discourse, viewing a scientific corpus not merely   This facilitates the understanding of how experimentally-
into a collection of documents, but also an extended conversation   founded knowledge is created and developed over time and
formed by the collective set of experiments, their introductions    space by a disjointed scholarly community, through processes
and interpretations. This paper introduces this approach as a       of reading, writing, and experimentation.
methodology called ‘Cycles of Scientific Investigation in
Discourse’ (CoSID). In CoSID, we capture the central conceptual         To capture how experiments are presented in technical
structure of a paper as a series of nested reasoning loops,         publications we create in CoSID three layers of
composed of passages in results sections, which describe            representation,each being a frame with associated properties:
individual research findings. We ground our work with a
number of worked examples based on data from the MINTACT               1. Context — the conceptual framework about some
and Pathway Logic databases, and illustrate the idea in the         phenomenon. In principle this exists ‘outside’ any particular
context of machine-enable biocuration1.                             paper, but for any paper, it provides the framework for all
                                                                    experiments within it (and also forms a localized context for
Keywords—interpretive framework for experiments, experiment         experiments from a single section of a paper). We model this
description as discourse, computational language technology         with a computational frame structure that includes slots for
                                                                    hypotheses, pointers to experiments, a description of the
               I. THEORETICAL BACKGROUND                            overall interrelation of experiments and interpretation, etc.
    All experiments consist of a series of actions performed            2. Experiment — a series of physically instantiated
upon entities, conducted for a reason, ending with a                activities governed by a goal and hypothesis, resulting in
measurement/evaluation of something and an interpretation as        observations and measurements. Generally a technical paper
conclusion. But people do not conduct experiments in a              containes many experiments (each possibly only briefly
vacuum. Experiments are formulated to explore possibilities         described). Each one explores some specific combination of
within a larger encapsulating theory, and their conclusions are     parameter values, and is modeled by a frame whose slots
intended to flesh out the unknown parts of the theory. They         provide the goal, method, observed results, specific
can therefore be viewed as ‘interaction turns’ in an ongoing        experimental implications, etc.
discourse, with internal linkage among corresponding portions
(specific goals, hypotheses, conclusions, etc.).                       3. Interpretation — the interpretations drawn from one or
                                                                    more experiments, leading back to the overall interpretation in
    Experiments, by their nature, are specific: actions situated    the Context (above). Each experiment’s local hypothesis
in time and space, performed with physical objects. Theories,       makes up a part of the global hypothesis of the Context.
in contrast, are by their nature general, intended to apply
beyond the particular time and place of the experiment. They            We represent a CoSID frame as a nested structure where a
employ abstractions that any particular experiment has to           single Context associates with multiple Experiments and is
instantiate as its artifacts and activities. Since theories are     concluded with a single Interpretation. Each CoSID frame is
‘conceptual’ while experiments are ‘practical’ in nature, it may    derived from a passage in the results section that points to
be very difficult for an experiment to serve as an absolute proof   subfigures that each report individual experiments. Figure 1
for any theory for all time and space.                              shows the application of CoSID to a sample article (pmid:
                                                                    10533201) where the discourse structure of a single frame
   CoSID (Cycles of Scientific Investigation in Discourse) is a     (Fig1AB) is explored. This frame consists of 12 clauses
model of experimentational text that takes into account these       moving from facts to methods, results, and interpretation, to
two points of view. If we postulate that that scientific            inform the frame structure as described above.
investigation proceeds in cycles of increasing theoretical
specificity (each round of experiments serving to inform the                           II. CORPORA AND DATA
next round of conceptual expansion), the CoSID model
provides a formalization that abstracts from the text of                Our overall goal is to produce automatically for a given
                                                                    scientific paper a set of instantiated CoSID frames, all properly
                                                                    connected, that completely and accurately reflect its contents.
1                                                                   To this end, we have to perform multiple quite distinct tasks,
 This work was funded by DARPA Big Mechanism program under
ARO contract W911NF-14-1-0436.
including determining the overall goals, backgroumd, and                                 a Conditional Random Field (CRF) model to assign types to
hypotheses of the paper, identifying where individual experi-                            these experiments (from either the PSI-MI2.53 or the Pathway
A: Whole Article                                B: Preliminary CoSID Frame               Logic typology). Separately, we parse the text into discourse
e.g., Zhang et al 1999, pmid10544201            e.g., 10544201, Fig 1AB                  segments, each roughly a clause, and identify for each one a
Introduction (Context)                          12 natural language clauses:             Discourse Segment type, along the lines of [3]. These types
   Methods
   Results (Experiment)
                                                 fact, fact, fact, method, fact,
                                                 method, result, result, result,
                                                                                         include the labels ‘fact’, ‘hypothesis’, ‘problem’, ‘goal’,
                                                 result, interpretation,                 ‘method’, ‘result’, and ‘interpretation’. As a third component,
        Context
           Experiment
                                                 interpretation
                                                                                         we are working to identify the theoretical model that underlies
                          Fig 1A+B
           Experiment     CoSID Frame           Experimental Type:                       each paper, which will form part of the Context frame.
           Experiment                            Northern Blot, Total RNA analysis
        Interpretation
                                                  1-3
                                                                                                                  IV. EARLY RESULTS
        Context
           Experiment
                                                                   4 5 Method
                                                                                  7-10       To date, we have implemented several modules, including:
                          Fig 1C                        Fact
           Experiment
                          CoSID Frame                                                    (A) A caption splitter that uses rules to identify individual
           Experiment                                          6
        Interpretation
                                                                              Result     experiments inside captions. Performance is >95%. (B) An
                                                   Interpretation
                                                                         11              experiment delimiter that uses rules to delimit the extent of
        ...               20+ CoSID Frames
                          for all experiments
                                                                                         each experiment description in the Results section of the paper.
        ...                                       12                                     (C) An experiment type tagger. We experimented with
Discussion (Interpretation)                                                              different numbers of types, sometimes condensing the less-
                                                                                         frequent ones together (F1-score: 71%). (D) A discourse
Figure 1: Applying CoSID frames to a research article. A: Overall                        segment type tagger, a trained CRF model to assign a discourse
structure of frames within the textual narrative, B: Discourse
                                                                                         segment tag to each clause (F1-score: 66%).
structure within a frame showing transition between discourse types.
                                                                                                                     V. NEXT STEPS
ment boundaries lie, understanding each of them individually,
connecting everything together, and then creating the                                        This work is ongoing. After completing the missing
appropriate interlinked frame structures.                                                components we plan to make available a collection of Ras
                                                                                         cancer papers with associated CoSID frames, which we believe
    This work is performed in the RUBICON project, funded                                will be useful within the FRIES consortium in the Big
by DARPA’s Big Mechanisms program, that is extracting                                    Mechanism program. FRIES includes groups building systems
relevant facts from a vast collection of papers about Ras                                that extract atomic information about entities and relations
cancers and formulating them to support theoretical model                                from papers about Ras cancer research, individuals creating
builders, automated reasoners, and actual experimenters [1].                             models of Ras cancer and associated experiments, and groups
Our contribution is to provide rich contexts in which individual                         building automated modeling and reasoning systems.
atomic statements about biological entities, extracted by others,
can be properly interpreted (for example, as hypothetical or as                              Some uses for our work include: downweighting the
actual, or as a local interpretation drawn from an experiment,                           certainty score for assertions that have been tagged as
or one drawn from some other work reported).                                             hypotheses, compared to facts; downweighting the assertions
                                                                                         from high-level conclusions, as compared to direct
   We focus on the text associated with the subfigure (i.e.,                             experimental findings, since the former may suffer from
Fig. 1A, 3C, etc.) and develop classifiers for the type of                               misconstrual; allowing models to cross-link experiments from
experiments performed. To test our work we compare to two                                different papers when their Experiment frames are similar
manually curated models of the data: The Pathway Logic                                   enough (i.e., they apply the same experimental techniques in
group at SRI International contain approximately 2,000 papers                            the same settings to the same materials); and more.
of which 76 are open-access. Each data record is assigned one                                We welcome suggestions for additional uses and extensions
of 33 separate ‘assay types’ 2 (such as ‘coprecipitation’,                               of the CoSID model.
‘phosphoryation’, etc). Similarly, the MINTACT database
provides hand-curated records of 37,268 experiments from                                                                REFERENCES
14,009 papers, of which 1,063 are available as open access                               [1]   Cohen, P.R. DARPA’s Big Mechanism program. Phys Biol 12, 045008
papers [2].                                                                                    (2015)
                                                                                         [2]   Orchard, S. et al. The MIntAct project—IntAct as a common curation
                              III. WORK TO DATE                                                platform for 11 molecular interaction databases. Nucleic Acids Res. 42,
                                                                                               D358–363 (2014).
    Our first step is to delimit each experiment. We accomplish
this by processing the caption of each Results section figure.                           [3]   de Waard, A. and Pander Maat, H.L.W., Verb form indicates discourse
                                                                                               segment type in biological research papers: Experimental evidence
Accuracy within captions is essentially perfect, given helpful                                 Journal of English for Academic Purposes, 11 (4), (pp. 357-366),
phrases like “Figure 2(a) depicts…”. Using this, we search                                     doi:10.1016/j.jeap.2012.06.002
within the Results section text to find a reference to the
corresponding portion of the figure, as in “As shown in Fig
2(a),…”. This forms the anchor for a span of text that, we
assume, provides details about a single experiment. We trained

2                                                                                        3
    http://pl.csl.sri.com/CurationNotebook/pages/Assays.html                                 http://www.psidev.info/node/60