<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Coreference Resolution for Biomedical Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miji Choi</string-name>
          <email>jooc1@student.unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karin Verspoor</string-name>
          <email>karin.verspoor@unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justin Zobel</string-name>
          <email>jzobel@unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1The University of Melbourne</institution>
          ,
          <addr-line>Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
          ,
          <addr-line>2National ICT</addr-line>
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Algorithms</institution>
          ,
          <addr-line>Performance, Reliability.</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of Melbourne</institution>
          ,
          <addr-line>Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>The accuracy of document processing activities such as retrieval or event extraction can be improved by resolution of lexical ambiguities. In this brief paper we investigate coreference resolution in biomedical texts, reporting on an experiment that shows the benefit of domain-specific knowledge. Comparison of a state-of-the-art general system with a purpose-built system shows that the latter is a dramatic improvement.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Coreference resolution</kwd>
        <kwd>domain-specific knowledge</kwd>
        <kwd>named entity recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The peer-reviewed scientific literature is a vast repository of
authoritative knowledge. The life sciences literature is the basis of
biomedical research and clinical practice, and must be searchable
to be of value. However, with around 40,000 new journal papers
every month, manual discovery or annotation is infeasible, and
thus it is critical that document processing techniques be robust
and accurate, to enable not only conventional search, but
automated discovery and assessment of knowledge such as
interacting relationships (events and facts) between biomolecules
such as proteins, genes, chemical compounds and drugs.
Biological molecular pathways, for example, integrated with
knowledge of relevant protein-protein interactions, or chemical
reactions, are used to understand complex biological processes
that could explain specific health conditions in human body in
biomedical and pharmaceutical research.</p>
      <p>
        A particular challenge is the need for lexical ambiguity resolution
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Lexical ambiguity is a general problem for text processing –
such as for search or for event extraction – but is particularly
acute in this domain, which has a vast but inconsistent technical
lexicon; the domain also presents particular opportunities,
because many technical terms are constructed in accordance with
a set of highly standardized rules. Thus while there are particular
kinds of ambiguity (genes and proteins may share names, for
example) there are also deductions that can be made from name
structure (for example, that a certain name must be a chemical).
A key obstacle is the low detection reliability of hidden or
complex mentions of entities involving coreference expressions in
natural language texts [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Thus, coreference resolution is an
essential task in information extraction, because it can
automatically provide links between entities, and as well can
facilitate better indexing for medical information search with rich
semantic information.
      </p>
      <p>For example, the following passage includes an interacting
relation; the binding event between the anaphoric mention the
protein and a cell entity CD40 is implied in the text. The mention
the protein refers to the specific protein name, TRAF2, previously
mentioned in the same discourse.</p>
      <p>… The phosphorylation appears to be related to the
signalling events that are activated by TRAF2 under
these circumstances, since two non-functional mutants
were found to be phosphorylated significantly less than
the wild-type protein. Furthermore, the phosphorylation
status of TRAF2 had significant effects on the ability of
the protein to bind to CD40, as evidenced by our
observations …
Such anaphoric mentions, or pronouns in texts, are mostly ignored
by event extraction systems, and are not considered as term
occurrences in information retrieval systems. In this brief paper,
we report an initial investigation of the challenges of biomedical
coreference resolution, test an existing general domain
coreference resolution system on biomedical texts, and
demonstrate that domain-specific knowledge can be helpful for
coreference resolution for the biomedical domain.</p>
    </sec>
    <sec id="sec-2">
      <title>2. EXPERIMENT</title>
      <p>
        To evaluate the important of domain-specific knowledge, we
compare an existing coreference resolution system, TEES, that
uses a domain-specific named entity recognition (NER) module
with an existing general system, CoreNLP, that does not use a
domain-specific NER. The aim is to explore how domain-specific
information impacts on performance for coreference resolution
involving protein and gene entities. The TEES system, which
includes a biomedical domain-specific NER component for
protein and gene mentions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and the Stanford CoreNLP system,
which uses syntactic and discourse information but no NER
outputs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], are evaluated on a domain-specific annotated corpus.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Data Sets</title>
      <p>
        We use the training dataset from the Protein Coreference Shared
task at BioNLP 2011 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for our evaluation of existing
coreference resolution systems. The annotated corpus includes
2,313 coreference relations, which are pairs of anaphors and
antecedents related to protein and gene entities, from 800 Pubmed
journal abstracts. As shown in Table 1, this gold standard dataset
consists of coreference relations involving relative pronouns such
as which, that, or who, or pronouns such as it, its, or they. Among
2,313 coreference relations, 560 relations embed one or more
specific protein and gene name.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Results</title>
      <p>Performance for identification of coreference mentions and
relations of each system evaluated on the annotated corpus is
compared in Table 2. The Stanford system achieved low
performance with F-score 12% and 2% for the detection of
coreference mentions and relations respectively, and produced a
greater number of detected mentions, while the TEES system
achieved better performance with F-score 69% and 37% for
coreference mention and relation levels respectively, but produced
smaller number of detections, which reduced system recall. Both
systems demonstrate huge reduction in detection of coreference
relations from the mention detection with the number of exact
matched 1,006 at the mention level to 112 by the Stanford system,
as well as from 2,466 to 546 by the TEES system.
Our investigation of low performance by each system at the
coreference relation level is analysed in detail in Figure 1.
Several factors such as lack of domain-specific knowledge (A),
bias towards selection of closest candidate of antecedent (B),
limiting analysis to within-sentence relations (C), syntactic
parsing error (D), and disregard of definite noun phrase (E) have
been observed. The main cause, lack of domain-specific
knowledge, is explored below.</p>
      <p>The annotated corpus contains 560 coreference relations, where
anaphoric mentions refer to protein or gene entities previously
mentioned in a text. For those coreference relations, the TEES
system outperformed the Stanford system by identifying 155 true
positives – far more than the 38 identified by the Stanford system,
as shown in Table 3.
The Stanford system also produces a large number of false
positives. Even though half of the false positives are relations
where anaphors are unclassified, the system links coreference
relations where an anaphor and an antecedent are identical, or
have a common head word (the main noun of the phrase). This is
because coreference resolution systems in general domains aim to
identify all mentions that refer to the same entity in a text, rather
than to resolve only specifically anaphoric mentions. Considering
those anaphoric mentions, inspection of individual instances (as
illustrated in Figure 2) strongly suggests that lack of
domainspecific knowledge is the main cause of failure.</p>
      <p>On the other hand, the TEES system achieved 77% precision, but
still only 28% recall. The main reason for the low recall is that the
system is limited to identification of coreference relations where
anaphors and antecedents corefer within a single sentence. Even
though anaphoric coreference mentions mostly link to their
antecedents across sentences, the system still identified 155
correct coreference relations by taking advantage of
domainspecific information provided through recognition of proteins.
example, the anaphoric mention the protein is correctly identified
as referring to TRAF2 by the TEES system, but the Stanford
System links it to the incorrect antecedent the wild-type protein.</p>
    </sec>
    <sec id="sec-5">
      <title>3. CONCLUSIONS</title>
      <p>In this study, we have explored how domain-specific knowledge
can be helpful for resolving coreferring expressions in the
biomedical domain. The performance difference between a system
using a domain-specific NER approach and a general system is
substantial. In detailed analysis of individual cases of failure (not
reported here) we have observed that the domain knowledge,
rather than variation in methods, is the main explanation for the
success of the domain-specific approach.</p>
    </sec>
    <sec id="sec-6">
      <title>4. ACKNOWLEDGMENTS</title>
      <p>This work was supported by the University of Melbourne, and by
the Australian Federal and Victorian State governments and the
Australian Research Council through the ICT Centre of
Excellence program, National ICT Australia (NICTA).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Krovetz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Homonymy and polysemy in information retrieval</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , J.-D. and
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J. i.</given-names>
          </string-name>
          <article-title>Overview of the protein coreference task in BioNLP shared task 2011</article-title>
          .
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Miwa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saetre</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , J.-D. and
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J. i.</given-names>
          </string-name>
          <article-title>Event extraction with complex event classification using rich features</article-title>
          .
          <source>Journal of bioinformatics and computational biology</source>
          ,
          <volume>8</volume>
          ,
          <fpage>01</fpage>
          <lpage>2010</lpage>
          ),
          <fpage>131</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Björne</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Salakoski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>Generalizing biomedical event extraction</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peirsman</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Stanford's multi-pass sieve coreference resolution system at the CoNLL-2011 shared task</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>