=Paper=
{{Paper
|id=Vol-289/paper-9
|storemode=property
|title=Towards a methodology for entity error analysis in annotated corpora
|pdfUrl=https://ceur-ws.org/Vol-289/po01.pdf
|volume=Vol-289
|dblpUrl=https://dblp.org/rec/conf/kcap/WeiKC07
}}
==Towards a methodology for entity error analysis in annotated corpora==
Towards a methodology for entity
error analysis in annotated corpora
Qi Wei Yuval Krymolowski Nigel Collier
National Institute of Department of Computer National Institute of
Informatics Science Informatics
2-1-2, Chiyoda-ku University of Haifa 2-1-2, Chiyoda-ku
Tokyo 101-8430, Japan Haifa 31905, Israel Tokyo 101-8430, Japan
qiwei@nii.ac.jp yuval@cl.haifa.ac.il collier@nii.ac.jp
ABSTRACT to factors such as long and descriptive naming conventions
We present a methodology for error analysis in entity anno- and conjunctive and disjunctive structures.
tation. To increase the accuracy in corpora, there is a need
for an analysis method for detecting human annotation and In most of the current error analyses[3, 5], one selects a
schema errors. We use easiness statistics and information fixed number of errors and classifies them manually. In such
gain to gain insights into possible causes of error in the GE- cases, there is a critical need for analysis tools and methods
NIA corpus of MEDLINE abstracts. for detecting human annotation errors and schema inconsis-
tencies.
Categories and Subject Descriptors In this paper, we present a general method for error analysis
I.2.7 [Artificial Intelligent]: Natural language processing on annotated corpora. By applying this method, we can
access every error in our testing data and get more detailed
General Terms information on the errors.
error analyse algorithms
2. METHOD
After obtaining the test results from 400 models, we applied
1. INTRODUCTION easiness and hardness statistics[4] to each instance. Then
With the rapid expansion of biomedical research, an over-
we constructed a confusion matrix from the hard instances.
whelming number of research publications are being pro-
In addition, we used the information gain derived from the
duced which require searching. In order to help with this
easiness and hardness statistic to calculate the contribution
task, text mining has been applied in areas ranging from the
of each feature used in the NER system.
extraction of signal transduction pathways to the analysis of
infectious disease outbreaks. Within text mining, named en-
tity recognition (NER), which seeks to identify and classify 2.1 Easiness and hardness statistics
terms into predefined target classes, is regarded as the first Easiness and hardness statistics were first created by Kry-
key stage in mapping to a computable semantic representa- molowski [4]. Consider a collection of models with similar
tion. recalls and precisions; correctly classified words may be dif-
ferent. If a word can be classified by all models, it is treated
NER originated from the Message Understanding Confer- as easy and if it is classified wrongly by all models, it is
ences (MUC) in 1990s. The task in MUC is to identify treated as hard. The definition of easiness and hardness
terms such as person name, organization name, etc., in the comes from this idea. Let L denote a set of supervised learn-
Newswire domain. During the last few years, NER in the ing models and T the set of test data. Each instance t ∈ T
biological domain has improved rapidly. The task in biolog- can be characterized by a bit-vector:
ical named entity recognition (BioNER) is to identify and
label DNA and other products. The accuracy for BioNER v(t) = {v1 (t), · · · , vn (t)},
(about 70%) is much lower the average 90% accuracy for the where
MUC task.Compared with the Newswire domain, the enti- ½
ties in the biomedical domain tend to be more complex due 1 t was labeled correctly by model I,
vi (t) =
0 t was labeled wrongly by model I
Easiness is defined according to the vector v(t):
Pn
1 vn (t)
easiness(t) =
n
which is the probability of correctly labeling t by one of the
classification models. The value of easiness(t) is between
0 and 1. Here, we define that an instance whose easiness
is between 0 and 0.1 is called hard and an instance whose Most of the errors were caused by inconsistent annotations.For
easiness is between 0.9 and 1 is called easy. example,
Hard and easy instances can be further divided. We focus on
hard instances that most models can not recognize correctly. 1. .. in normal T cells in which IL-2R alpha expression
has been induced.
2.2 Information Gain 2. .. are activated in normal T cells in response to IL-2.
Information gain[1] is used to calculate the contribution of
each feature used in the NER system.The entropy for NE
classes H(C) is defined by In the first sentence, ”T cells” without ”normal” was anno-
tated as a cell type, while in the second sentence, ”normal
T cells” was annotated as a cell type in the original corpus.
X
H(C) = − p(c)log2 p(c)
c∈C In the result, a kind of errors were found which we called
n(c)
incomplete forms,For example,
where p(c) = N
,n(c) stands for the number of words in
class c and N stands for the total number of words in data
pool 1. protein kinase C-alpha , - ep-
silon , and - zeta
When a feature F is given, the conditional entropy for NE
classes H(C|F) is defined by 2. < proteinmolecule > LMP1 and 2 < proteinmolecule
XX >
H(C|F ) = − p(c, f )log2 p(c|f )
c∈C f ∈F
Forms like ’-epsilon’, ’-zeta’ are in-complete, and they need
where p(c, f ) = n(c,f
N
)
; p(c|f ) = n(c,f
n(f )
)
; n(c,f) stands for the to be recovered to their full terms of ’C-epsilon’ and ’C-zeta’.
number of words in class c with the feature value f and n(f)
stands for the number of words with the feature value f 4. CONCLUSIONS
Corpus error analysis is an important step in improving the
The information gain for NE classes and a feature I(C;F) accuracy of bioNER. The easiness and hardness statistics
can be calculated as: used here are effective in measuring the degree of hardness
that a model has in recognizing one entity. We focused on
I(C; F ) = H(C) − H(C|F )
the hard entities and this made it easy to get all errors in
the experiment results. Also, this allowed us to select error
The information gain shows how the feature F contributed categories for drill down analysis. The importance of a fea-
to the classification. I(C;F) equals 0 if feature F is com- ture can be learned by using the information gain, and from
pletely independent of C and equals 1 if F gives sufficient the import features, evidence can be found to strengthen the
information to label named entities. results. We used these two methods together and it helped
us to find inconsistent annotations in the GENIA corpus.
To deal with different features, the information gain has to
be normalized as the information ratio: 5. REFERENCES
I(C; F ) [1] L. Breiman, R. Friedman, A. Olshen, and C.Stone.
GR(C; F ) = Classification and regression tree. In Belmont CA:
H(C)
Wadsworth International Group, 1984.
GR(C; F) ratios are close to 1 and 0 and can be compared
[2] N. Cristianini and J. Shawe-Taylor. An introduction to
even if the class entropies are different.
support vector machines: and other kernel based
learning methods. In Cambridge University Press, New
3. EXPERIMENT York, NY, 2000.
3.1 Data set and models [3] S. Dingare, M. Nission, J.Finkel, C. Manning, and
GENIA corpus version 3.02 was used in this experiment. C. Grover. A system for identifying named entities in
36 classes were used to annotate the corpus.SVM[2] was se- biomedical text: How results from two evaluations
lected as the supervised model in the test and 400 different reflect on both the system and evaluations comparative
models were used. 40% of the corpus taken from the be- and func-tional genomics. 2005.
ginning was used for testing. 24% of the corpus (randomly [4] Y. Krymolowski. Distinguishing easy and hard
sampled) was used to train the 400 different models. No instances international. In Conference On
cascaded entities existed in this experiment; only the longest Computational Linguistics, 2002.
entity was annotated. [5] G. Zhou. Recognizing names in biomedical texts using
hidden markov model and svm plus sigmoid. In
3.2 Results International Joint workshop on Natural language
Using the method described above, errors were successfully Processing in Biomedicine and its
classified into three types: 1. Boundary errors and no classi- Applica-tions(JNLPBA) 2004, 2004.
fication errors; 2. Boundary errors with classification errors;
3. Only classification errors with no boundary errors.