<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a methodology for entity error analysis in annotated corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qi Wei</string-name>
          <email>qiwei@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuval Krymolowski</string-name>
          <email>yuval@cl.haifa.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nigel Collier</string-name>
          <email>collier@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>University of Haifa</institution>
          ,
          <addr-line>Haifa 31905</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of</institution>
          ,
          <addr-line>Informatics, 2-1-2, Chiyoda-ku, Tokyo 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a methodology for error analysis in entity annotation. To increase the accuracy in corpora, there is a need for an analysis method for detecting human annotation and schema errors. We use easiness statistics and information gain to gain insights into possible causes of error in the GENIA corpus of MEDLINE abstracts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>General Terms</title>
      <p>error analyse algorithms</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>With the rapid expansion of biomedical research, an
overwhelming number of research publications are being
produced which require searching. In order to help with this
task, text mining has been applied in areas ranging from the
extraction of signal transduction pathways to the analysis of
infectious disease outbreaks. Within text mining, named
entity recognition (NER), which seeks to identify and classify
terms into prede¯ned target classes, is regarded as the ¯rst
key stage in mapping to a computable semantic
representation.</p>
      <p>NER originated from the Message Understanding
Conferences (MUC) in 1990s. The task in MUC is to identify
terms such as person name, organization name, etc., in the
Newswire domain. During the last few years, NER in the
biological domain has improved rapidly. The task in
biological named entity recognition (BioNER) is to identify and
label DNA and other products. The accuracy for BioNER
(about 70%) is much lower the average 90% accuracy for the
MUC task.Compared with the Newswire domain, the
entities in the biomedical domain tend to be more complex due
to factors such as long and descriptive naming conventions
and conjunctive and disjunctive structures.</p>
      <p>
        In most of the current error analyses[
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ], one selects a
¯xed number of errors and classi¯es them manually. In such
cases, there is a critical need for analysis tools and methods
for detecting human annotation errors and schema
inconsistencies.
      </p>
      <p>In this paper, we present a general method for error analysis
on annotated corpora. By applying this method, we can
access every error in our testing data and get more detailed
information on the errors.</p>
    </sec>
    <sec id="sec-3">
      <title>2. METHOD</title>
      <p>
        After obtaining the test results from 400 models, we applied
easiness and hardness statistics[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to each instance. Then
we constructed a confusion matrix from the hard instances.
In addition, we used the information gain derived from the
easiness and hardness statistic to calculate the contribution
of each feature used in the NER system.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.1 Easiness and hardness statistics</title>
      <p>
        Easiness and hardness statistics were ¯rst created by
Krymolowski [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Consider a collection of models with similar
recalls and precisions; correctly classi¯ed words may be
different. If a word can be classi¯ed by all models, it is treated
as easy and if it is classi¯ed wrongly by all models, it is
treated as hard. The de¯nition of easiness and hardness
comes from this idea. Let L denote a set of supervised
learning models and T the set of test data. Each instance t 2 T
can be characterized by a bit-vector:
      </p>
      <p>v(t) = fv1(t); ¢ ¢ ¢ ; vn(t)g;
where
vi(t) =
½ 1 t was labeled correctly by model I,</p>
      <p>0 t was labeled wrongly by model I
Easiness is de¯ned according to the vector v(t):</p>
      <p>P1n vn(t)
which is the probability of correctly labeling t by one of the
classi¯cation models. The value of easiness(t) is between
0 and 1. Here, we de¯ne that an instance whose easiness
is between 0 and 0.1 is called hard and an instance whose
easiness is between 0.9 and 1 is called easy.</p>
      <p>Most of the errors were caused by inconsistent annotations.For
example,
Hard and easy instances can be further divided. We focus on
hard instances that most models can not recognize correctly.</p>
    </sec>
    <sec id="sec-5">
      <title>2.2 Information Gain</title>
      <p>
        Information gain[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used to calculate the contribution of
each feature used in the NER system.The entropy for NE
classes H(C) is de¯ned by
      </p>
      <p>H(C) = ¡ X p(c)log2p(c)</p>
      <p>c2C
where p(c) = nN(c) ,n(c) stands for the number of words in
class c and N stands for the total number of words in data
pool
When a feature F is given, the conditional entropy for NE
classes H(CjF) is de¯ned by</p>
      <p>H(CjF ) = ¡ X X p(c; f )log2p(cjf )</p>
      <p>c2C f2F
where p(c; f ) = n(c;f) ; p(cjf ) = n(c;f) ; n(c,f) stands for the</p>
      <p>N n(f)
number of words in class c with the feature value f and n(f)
stands for the number of words with the feature value f
The information gain for NE classes and a feature I(C;F)
can be calculated as:</p>
      <p>I(C; F ) = H(C) ¡ H(CjF )
The information gain shows how the feature F contributed
to the classi¯cation. I(C;F) equals 0 if feature F is
completely independent of C and equals 1 if F gives su±cient
information to label named entities.</p>
      <p>To deal with di®erent features, the information gain has to
be normalized as the information ratio:</p>
      <p>GR(C; F ) =</p>
      <p>I(C; F )
H(C)
GR(C; F) ratios are close to 1 and 0 and can be compared
even if the class entropies are di®erent.</p>
    </sec>
    <sec id="sec-6">
      <title>3. EXPERIMENT</title>
    </sec>
    <sec id="sec-7">
      <title>3.1 Data set and models</title>
      <p>
        GENIA corpus version 3.02 was used in this experiment.
36 classes were used to annotate the corpus.SVM[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was
selected as the supervised model in the test and 400 di®erent
models were used. 40% of the corpus taken from the
beginning was used for testing. 24% of the corpus (randomly
sampled) was used to train the 400 di®erent models. No
cascaded entities existed in this experiment; only the longest
entity was annotated.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Results</title>
      <p>Using the method described above, errors were successfully
classi¯ed into three types: 1. Boundary errors and no
classi¯cation errors; 2. Boundary errors with classi¯cation errors;
3. Only classi¯cation errors with no boundary errors.
1. .. in normal T cells in which IL-2R alpha expression
has been induced.
In the ¯rst sentence, "T cells" without "normal" was
annotated as a cell type, while in the second sentence, "normal
T cells" was annotated as a cell type in the original corpus.
In the result, a kind of errors were found which we called
incomplete forms,For example,
1. &lt;proteinmolecule&gt; protein kinase C-alpha , -
epsilon , and - zeta &lt;pro-teinmolecule&gt;
2. &lt; proteinmolecule &gt; LMP1 and 2 &lt; proteinmolecule
&gt;
Forms like '-epsilon', '-zeta' are in-complete, and they need
to be recovered to their full terms of 'C-epsilon' and 'C-zeta'.</p>
    </sec>
    <sec id="sec-9">
      <title>4. CONCLUSIONS</title>
      <p>Corpus error analysis is an important step in improving the
accuracy of bioNER. The easiness and hardness statistics
used here are e®ective in measuring the degree of hardness
that a model has in recognizing one entity. We focused on
the hard entities and this made it easy to get all errors in
the experiment results. Also, this allowed us to select error
categories for drill down analysis. The importance of a
feature can be learned by using the information gain, and from
the import features, evidence can be found to strengthen the
results. We used these two methods together and it helped
us to ¯nd inconsistent annotations in the GENIA corpus.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Olshen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Stone</surname>
          </string-name>
          .
          <article-title>Classi¯cation and regression tree</article-title>
          . In Belmont CA: Wadsworth International Group,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Cristianini</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Shawe-Taylor</surname>
          </string-name>
          .
          <article-title>An introduction to support vector machines: and other kernel based learning methods</article-title>
          . In Cambridge University Press, New York, NY,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dingare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nission</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          .
          <article-title>A system for identifying named entities in biomedical text: How results from two evaluations re°ect on both the system and evaluations comparative and func-tional genomics</article-title>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Krymolowski</surname>
          </string-name>
          .
          <article-title>Distinguishing easy and hard instances international</article-title>
          .
          <source>In Conference On Computational Linguistics</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Recognizing names in biomedical texts using hidden markov model and svm plus sigmoid</article-title>
          .
          <source>In International Joint workshop on Natural language Processing in Biomedicine and its Applica-tions(JNLPBA)</source>
          <year>2004</year>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>