<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Mining bridging the gap between knowledge and text (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XVIII International Conference «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL'2016)</institution>
          ,
          <addr-line>Ershovo</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sophia Ananiadou NaCTeM, University of Manchecter</institution>
        </aff>
      </contrib-group>
      <fpage>140</fpage>
      <lpage>141</lpage>
      <abstract>
        <p>Useful pathway models require a complete and accurate representation of the system, which requires that all relevant molecular species are captured, together with their physical interactions and chemical reactions. Pathway model reconstruction is currently largely carried out manually by domain experts, who must carefully read the scientific literature, in order to retrieve, evaluate and interpret and distil relevant fine-grained statements. Moreover, due to the proliferation of scientific databases and ontologies, discovery of previously unknown knowledge demands that scientists take into account information from many different resources, covering different levels of contextual information (e.g., degree of confidence or certainty expressed towards a finding). Thus, given the high complexity mechanisms involved in pathway models, whose detailed description can only be derived from analysis of heterogeneous, fragmented and incomplete sources, reconstructing pathway models is a slow, difficult and laborious process. Accordingly, there is a need to develop methods that help experts to make sense of the continuously growing body of literature, in order to increase the speed and reliability of knowledge discovery.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In response to the above, text mining (TM) aims to
automate the above process, by finding relations (such as
interactions) that hold between concepts of different
types (e.g., genes/proteins, chemical compounds,
metabolites, subcellular components, anatomical entities,
organisms, cell lines, strains, diseases). A large number
of TM methods aim to extract simple binary relations
from e.g., A binds B. This is mainly achieved by focusing
on textual co-occurrences, using bag-of-words
approaches, analysis of controlled vocabulary metadata,
and other shallow techniques. However, these
approaches have several disadvantages, including the
identification of many false positive relations.
Additionally, they fail to take into account contextual
information about relations, e.g., the cellular context of a
signaling event, such as cell type and localization.</p>
    </sec>
    <sec id="sec-2">
      <title>In contrast, our work involves the development of more sophisticated TM techniques to extract events, which encapsulate typed n-ary relationships, i.e., interactions between any number of concepts. Events are</title>
      <p>able to capture detailed information about mechanisms
of biological pertinence, e.g.,, reactions such as negative
regulation, phosphorylation, carboxylation), by linking
together interacting participants, which play specific
roles (e.g., modifier, reactant, product, cause, location).
As such, they are able to encode several types of
contextual information, that are frequently missing when
only binary relations are considered.</p>
      <p>Consider an intuitive example from the literature to
explain our goal: The results suggest that the narL gene
product activates the nitrate reductase operon. (PMID:
3035558). This sentence provides interpretative
information about the reaction between the narL gene
product and the nitrate reductase operon, namely that the
information stated in based on an analysis/interpretation
of experimental results, and that there is a certain amount
of speculation expressed towards the reaction (according
to the use of the verb suggest, rather than a more definite
verb, such as demonstrate). Next, consider a more
complex example: The analysis showed that IEXC29S
was unable to significantly transactivate the
c-sis/PDGFB promoter. Whilst a conventional TM analysis to find
binary relationships would simply discover that some
type of interaction occurs between IEXC29S and
csis/PDFG-B, a more detailed contextual analysis would
allow the construction of a representation that encodes
the complex details of the interaction, e.g., that the
information is stated based on an experimental analysis,
and that the interaction has been shown to occur with a
low level of intensity. .</p>
      <p>
        In order to extract such complex events
automatically, we have developed a pipeline-based event
extraction system, EventMine [
        <xref ref-type="bibr" rid="ref1 ref8">1</xref>
        ], which employs a
series of classifier modules to capture core event
elements: detection of triggers (words or phrases that
characterise the event; typically verbs or their
nominalisations ,detection of edges (finding links
between pairs of concepts), and complex event detection
(combining multiple edges of complex n-ary relations).
      </p>
      <p>
        EventMine utilises a rich set of features including
those obtained from dependency parse trees supplied by
the GENIA Dependency Parser [
        <xref ref-type="bibr" rid="ref2 ref9">2</xref>
        ], as well as from
predicate-argument structures determined by Enju [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
which has been adapted for application to biomedical
text. EventMine is capable of extracting interactions
across different sentences, owing to its capability to
incorporate results from a pre-executed coreference
resolution method [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this way, event participants
which are semantically empty (e.g., expressions such as
it, that) are resolved to their referents and thus become
more informative. In addition, the system can be adapted
to different tasks without the need for task-specific
tuning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Finally, EventMine also facilitates the
extraction of interpretative context by detecting various
event attributes, e.g., polarity, certainty, manner,
knowledge type and source [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As with its other
classifier modules, EventMine uses SVMs for this task,
facilitated through its training on the GENIA
Metaknowledge corpus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The automatic extraction of events from biomedical
text has a broad range of applications, which include not
only support for the creation and annotation of pathways
[8], but also automatic population/enrichment of
databases and semantic search systems. To develop
systems that are customized for different tasks such as
the above, a text mining infrastructure is needed, which
is able to support the curation and maintenance of
pathways, sharing and re-using of knowledge distributed
over thousands of scientific publications and monitoring
of recent publications is needed to maintain relevance.
To foster adaptability of TM solutions, we are using our
UIMA based Argo platform [9], which enables the
development of highly customisable solutions in the
form of reconfigurable modular text mining pipelines
(workflows). Apart from supporting the straightforward
integration of application-specific components,
reconfigurable workflows allow for discovery of optimal
solutions [10] owing to their interchangeable underlying
components. For components to be interoperable (i.e., for
one component to build on the text annotations created
by another), the outputs of a predecessor component
must be type-compatible with the inputs expected by a
successor. In Argo, this is ensured by mechanisms that
support mapping between similar semantic types and
conversion of annotations [11].</p>
      <p>Argo has supported the development of systems such
as PathText21, an integrated search system that links
biological pathways with supporting knowledge in the
literature [8]. It reads formal pathway models
(represented in the Systems Biology Markup Language
(SBML) and converts them into queries that are
submitted to three semantic search systems operating
over MEDLINE, i.e., KLEIO, which improves and
expands on standard literature querying with semantic
categories and facetted search, FACTA+ (see below) and
MEDIE (http://www.nactem.ac.uk/medie/), which
extracts events. MEDIE has been found to achieve the
highest hit ratio, which demonstrates the superiority of
events for finding relevant interactions.</p>
      <p>FACTA+2 [12] discovers hidden, previously
unknown associations between both concepts and
complex events from the literature (such as Gene
expression, Positive regulation, Binding, Regulation,
etc.). Such associations can often only be found by
linking information that may be dispersed across many
documents, and thus which might be missed using
manual search methods. This facilitates hypothesis
generation, which is directly relevant to pathway
construction. FACTA+ approaches the problem of
automatic discovery of useful hypotheses by combining
two (or more) known associations, i.e., if concept X is
associated with concept Y and concept Y is associated
with concept Z, then the potential indirect association
between X and Z is considered as a useful hypothesis
unless there is already a known association between X
and Z. FACTA+ supports the discovery of indirect
associations based not only on concepts but also on
complex events such as Gene expression, Positive
regulation, Binding, Regulation, etc.</p>
    </sec>
    <sec id="sec-3">
      <title>Advanced TM methods such as those described here</title>
      <p>support pathway curation, validation and maintenance.
Their employment yields improved coverage, faster
acquisition and throughput, combined with easier
identification and normalisation of duplicates, greater
consistency, completeness and accuracy in description,
and reduced curator burden. This helps to free experts
from mundane and tedious tasks while aiding with more
intellectually challenging ones.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Miwa</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saetre</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>JD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Event extraction with complex event classification using rich features</article-title>
          .
          <source>J Bioinform Comput Biol</source>
          .
          <year>2010</year>
          ;
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <fpage>131</fpage>
          -
          <lpage>46</lpage>
          . Epub 2010/02/26. doi: S0219720010004586 [pii].
          <source>PubMed PMID: 20183879.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Sagae</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii Ji</surname>
          </string-name>
          .
          <article-title>Dependency parsing and domain adaptation with LR models and parser ensembles</article-title>
          .
          <source>Proceedings of CoNLL 2007</source>
          Shared Task;
          <year>2007</year>
          . p.
          <fpage>1044</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Miyao</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sagae</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saetre</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsuzaki</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Evaluating contributions of natural language parsers to protein-protein interaction extraction</article-title>
          .
          <source>Bioinformatics</source>
          .
          <year>2009</year>
          ;
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <fpage>394</fpage>
          -
          <lpage>400</lpage>
          . PubMed PMID:
          <volume>19073593</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Miwa</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Boosting automatic event extraction from the literature using domain adaptation and coreference resolution</article-title>
          .
          <source>Bioinformatics</source>
          .
          <year>2012</year>
          ;
          <volume>28</volume>
          (
          <issue>13</issue>
          ):
          <fpage>1759</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/bts237.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Miwa</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <article-title>Ananiadou S. Adaptable, high recall, event extraction system with minimal configuration</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <year>2015</year>
          ;
          <volume>16</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . doi:
          <volume>10</volume>
          .1186/
          <fpage>1471</fpage>
          -2105-
          <fpage>16</fpage>
          - s10-s7.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Miwa</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNaught</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kell</surname>
            <given-names>DB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Extracting semantically enriched events from biomedical literature</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <year>2012</year>
          ;
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>108</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Thompson</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nawaz</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNaught</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Enriching</surname>
          </string-name>
          <article-title>a biomedical event corpus with meta-knowledge annotation</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <year>2011</year>
          ;
          <volume>12</volume>
          :
          <fpage>393</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>1 http://www.nactem.ac.uk/pathtext2/demo/</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>2 http://www.nactem.ac.uk/facta/</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>