=Paper=
{{Paper
|id=Vol-289/paper-9
|storemode=property
|title=Towards a methodology for entity error analysis in annotated corpora
|pdfUrl=https://ceur-ws.org/Vol-289/po01.pdf
|volume=Vol-289
|dblpUrl=https://dblp.org/rec/conf/kcap/WeiKC07
}}
==Towards a methodology for entity error analysis in annotated corpora==
<pdf width="1500px">https://ceur-ws.org/Vol-289/po01.pdf</pdf>
<pre>
                            Towards a methodology for entity
                           error analysis in annotated corpora

                      Qi Wei                        Yuval Krymolowski                               Nigel Collier
               National Institute of              Department of Computer                        National Institute of
                   Informatics                           Science                                    Informatics
               2-1-2, Chiyoda-ku                    University of Haifa                         2-1-2, Chiyoda-ku
             Tokyo 101-8430, Japan                  Haifa 31905, Israel                       Tokyo 101-8430, Japan
                qiwei@nii.ac.jp                    yuval@cl.haifa.ac.il                          collier@nii.ac.jp

ABSTRACT                                                         to factors such as long and descriptive naming conventions
We present a methodology for error analysis in entity anno-      and conjunctive and disjunctive structures.
tation. To increase the accuracy in corpora, there is a need
for an analysis method for detecting human annotation and        In most of the current error analyses[3, 5], one selects a
schema errors. We use easiness statistics and information        fixed number of errors and classifies them manually. In such
gain to gain insights into possible causes of error in the GE-   cases, there is a critical need for analysis tools and methods
NIA corpus of MEDLINE abstracts.                                 for detecting human annotation errors and schema inconsis-
                                                                 tencies.
Categories and Subject Descriptors                               In this paper, we present a general method for error analysis
I.2.7 [Artificial Intelligent]: Natural language processing      on annotated corpora. By applying this method, we can
                                                                 access every error in our testing data and get more detailed
General Terms                                                    information on the errors.
error analyse algorithms
                                                                 2.    METHOD
                                                                 After obtaining the test results from 400 models, we applied
1.   INTRODUCTION                                                easiness and hardness statistics[4] to each instance. Then
With the rapid expansion of biomedical research, an over-
                                                                 we constructed a confusion matrix from the hard instances.
whelming number of research publications are being pro-
                                                                 In addition, we used the information gain derived from the
duced which require searching. In order to help with this
                                                                 easiness and hardness statistic to calculate the contribution
task, text mining has been applied in areas ranging from the
                                                                 of each feature used in the NER system.
extraction of signal transduction pathways to the analysis of
infectious disease outbreaks. Within text mining, named en-
tity recognition (NER), which seeks to identify and classify     2.1     Easiness and hardness statistics
terms into predefined target classes, is regarded as the first   Easiness and hardness statistics were first created by Kry-
key stage in mapping to a computable semantic representa-        molowski [4]. Consider a collection of models with similar
tion.                                                            recalls and precisions; correctly classified words may be dif-
                                                                 ferent. If a word can be classified by all models, it is treated
NER originated from the Message Understanding Confer-            as easy and if it is classified wrongly by all models, it is
ences (MUC) in 1990s. The task in MUC is to identify             treated as hard. The definition of easiness and hardness
terms such as person name, organization name, etc., in the       comes from this idea. Let L denote a set of supervised learn-
Newswire domain. During the last few years, NER in the           ing models and T the set of test data. Each instance t ∈ T
biological domain has improved rapidly. The task in biolog-      can be characterized by a bit-vector:
ical named entity recognition (BioNER) is to identify and
label DNA and other products. The accuracy for BioNER                                     v(t) = {v1 (t), · · · , vn (t)},
(about 70%) is much lower the average 90% accuracy for the       where
MUC task.Compared with the Newswire domain, the enti-                             ½
ties in the biomedical domain tend to be more complex due                             1    t was labeled correctly by model I,
                                                                       vi (t) =
                                                                                      0    t was labeled wrongly by model I

                                                                 Easiness is defined according to the vector v(t):
                                                                                                  Pn
                                                                                                    1 vn (t)
                                                                                   easiness(t) =
                                                                                                      n

                                                                 which is the probability of correctly labeling t by one of the
                                                                 classification models. The value of easiness(t) is between
                                                                 0 and 1. Here, we define that an instance whose easiness
is between 0 and 0.1 is called hard and an instance whose              Most of the errors were caused by inconsistent annotations.For
easiness is between 0.9 and 1 is called easy.                          example,

Hard and easy instances can be further divided. We focus on
hard instances that most models can not recognize correctly.                1. .. in normal T cells in which IL-2R alpha expression
                                                                               has been induced.
2.2    Information Gain                                                     2. .. are activated in normal T cells in response to IL-2.
Information gain[1] is used to calculate the contribution of
each feature used in the NER system.The entropy for NE
classes H(C) is defined by                                             In the first sentence, ”T cells” without ”normal” was anno-
                                                                       tated as a cell type, while in the second sentence, ”normal
                                                                       T cells” was annotated as a cell type in the original corpus.
                               X
                  H(C) = −           p(c)log2 p(c)
                               c∈C                                     In the result, a kind of errors were found which we called
                 n(c)
                                                                       incomplete forms,For example,
 where p(c) =     N
                      ,n(c) stands for the number of words in
class c and N stands for the total number of words in data
pool                                                                        1. <proteinmolecule> protein kinase C-alpha , - ep-
                                                                               silon , and - zeta <pro-teinmolecule>
When a feature F is given, the conditional entropy for NE
classes H(C|F) is defined by                                                2. < proteinmolecule > LMP1 and 2 < proteinmolecule
                         XX                                                    >
           H(C|F ) = −         p(c, f )log2 p(c|f )
                           c∈C f ∈F
                                                                       Forms like ’-epsilon’, ’-zeta’ are in-complete, and they need
where p(c, f ) = n(c,f
                   N
                       )
                         ; p(c|f ) = n(c,f
                                      n(f )
                                           )
                                             ; n(c,f) stands for the   to be recovered to their full terms of ’C-epsilon’ and ’C-zeta’.
number of words in class c with the feature value f and n(f)
stands for the number of words with the feature value f                4.     CONCLUSIONS
                                                                       Corpus error analysis is an important step in improving the
The information gain for NE classes and a feature I(C;F)               accuracy of bioNER. The easiness and hardness statistics
can be calculated as:                                                  used here are effective in measuring the degree of hardness
                                                                       that a model has in recognizing one entity. We focused on
                  I(C; F ) = H(C) − H(C|F )
                                                                       the hard entities and this made it easy to get all errors in
                                                                       the experiment results. Also, this allowed us to select error
The information gain shows how the feature F contributed               categories for drill down analysis. The importance of a fea-
to the classification. I(C;F) equals 0 if feature F is com-            ture can be learned by using the information gain, and from
pletely independent of C and equals 1 if F gives sufficient            the import features, evidence can be found to strengthen the
information to label named entities.                                   results. We used these two methods together and it helped
                                                                       us to find inconsistent annotations in the GENIA corpus.
To deal with different features, the information gain has to
be normalized as the information ratio:                                5.     REFERENCES
                                  I(C; F )                             [1] L. Breiman, R. Friedman, A. Olshen, and C.Stone.
                   GR(C; F ) =                                             Classification and regression tree. In Belmont CA:
                                   H(C)
                                                                           Wadsworth International Group, 1984.
GR(C; F) ratios are close to 1 and 0 and can be compared
                                                                       [2] N. Cristianini and J. Shawe-Taylor. An introduction to
even if the class entropies are different.
                                                                           support vector machines: and other kernel based
                                                                           learning methods. In Cambridge University Press, New
3. EXPERIMENT                                                              York, NY, 2000.
3.1 Data set and models                                                [3] S. Dingare, M. Nission, J.Finkel, C. Manning, and
GENIA corpus version 3.02 was used in this experiment.                     C. Grover. A system for identifying named entities in
36 classes were used to annotate the corpus.SVM[2] was se-                 biomedical text: How results from two evaluations
lected as the supervised model in the test and 400 different               reflect on both the system and evaluations comparative
models were used. 40% of the corpus taken from the be-                     and func-tional genomics. 2005.
ginning was used for testing. 24% of the corpus (randomly              [4] Y. Krymolowski. Distinguishing easy and hard
sampled) was used to train the 400 different models. No                    instances international. In Conference On
cascaded entities existed in this experiment; only the longest             Computational Linguistics, 2002.
entity was annotated.                                                  [5] G. Zhou. Recognizing names in biomedical texts using
                                                                           hidden markov model and svm plus sigmoid. In
3.2    Results                                                             International Joint workshop on Natural language
Using the method described above, errors were successfully                 Processing in Biomedicine and its
classified into three types: 1. Boundary errors and no classi-             Applica-tions(JNLPBA) 2004, 2004.
fication errors; 2. Boundary errors with classification errors;
3. Only classification errors with no boundary errors.

</pre>