<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards a methodology for entity error analysis in annotated corpora</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Qi</forename><surname>Wei</surname></persName>
							<email>qiwei@nii.ac.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">National Institute of Informatics</orgName>
								<address>
									<addrLine>2-1-2, Chiyoda-ku</addrLine>
									<postCode>101-8430</postCode>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yuval</forename><surname>Krymolowski</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Haifa Haifa</orgName>
								<address>
									<postCode>31905</postCode>
									<country key="IL">Israel</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nigel</forename><surname>Collier</surname></persName>
							<email>collier@nii.ac.jp</email>
							<affiliation key="aff2">
								<orgName type="institution">National Institute of Informatics</orgName>
								<address>
									<addrLine>2-1-2, Chiyoda-ku</addrLine>
									<postCode>101-8430</postCode>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Towards a methodology for entity error analysis in annotated corpora</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">EA2DB596DDA4FC85D4E83ECEF01B5E09</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:49+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>I</term>
					<term>2</term>
					<term>7 [Artificial Intelligent]: Natural language processing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present a methodology for error analysis in entity annotation. To increase the accuracy in corpora, there is a need for an analysis method for detecting human annotation and schema errors. We use easiness statistics and information gain to gain insights into possible causes of error in the GE-NIA corpus of MEDLINE abstracts.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>With the rapid expansion of biomedical research, an overwhelming number of research publications are being produced which require searching. In order to help with this task, text mining has been applied in areas ranging from the extraction of signal transduction pathways to the analysis of infectious disease outbreaks. Within text mining, named entity recognition (NER), which seeks to identify and classify terms into predefined target classes, is regarded as the first key stage in mapping to a computable semantic representation.</p><p>NER originated from the Message Understanding Conferences (MUC) in 1990s. The task in MUC is to identify terms such as person name, organization name, etc., in the Newswire domain. During the last few years, NER in the biological domain has improved rapidly. The task in biological named entity recognition (BioNER) is to identify and label DNA and other products. The accuracy for BioNER (about 70%) is much lower the average 90% accuracy for the MUC task.Compared with the Newswire domain, the entities in the biomedical domain tend to be more complex due to factors such as long and descriptive naming conventions and conjunctive and disjunctive structures.</p><p>In most of the current error analyses <ref type="bibr" target="#b3">[3,</ref><ref type="bibr">5]</ref>, one selects a fixed number of errors and classifies them manually. In such cases, there is a critical need for analysis tools and methods for detecting human annotation errors and schema inconsistencies.</p><p>In this paper, we present a general method for error analysis on annotated corpora. By applying this method, we can access every error in our testing data and get more detailed information on the errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">METHOD</head><p>After obtaining the test results from 400 models, we applied easiness and hardness statistics <ref type="bibr" target="#b4">[4]</ref> to each instance. Then we constructed a confusion matrix from the hard instances. In addition, we used the information gain derived from the easiness and hardness statistic to calculate the contribution of each feature used in the NER system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Easiness and hardness statistics</head><p>Easiness and hardness statistics were first created by Krymolowski <ref type="bibr" target="#b4">[4]</ref>. Consider a collection of models with similar recalls and precisions; correctly classified words may be different. If a word can be classified by all models, it is treated as easy and if it is classified wrongly by all models, it is treated as hard. The definition of easiness and hardness comes from this idea. Let L denote a set of supervised learning models and T the set of test data. Each instance t ∈ T can be characterized by a bit-vector:</p><formula xml:id="formula_0">v(t) = {v1(t), • • • , vn(t)}, where v i (t) =</formula><p>1 t was labeled correctly by model I, 0 t was labeled wrongly by model I Easiness is defined according to the vector v(t):</p><formula xml:id="formula_1">easiness(t) = n 1 vn(t) n</formula><p>which is the probability of correctly labeling t by one of the classification models. The value of easiness(t) is between 0 and 1. Here, we define that an instance whose easiness is between 0 and 0.1 is called hard and an instance whose easiness is between 0.9 and 1 is called easy.</p><p>Hard and easy instances can be further divided. We focus on hard instances that most models can not recognize correctly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Information Gain</head><p>Information gain <ref type="bibr" target="#b1">[1]</ref> is used to calculate the contribution of each feature used in the NER system.The entropy for NE classes H(C) is defined by</p><formula xml:id="formula_2">H(C) = − c∈C p(c)log2p(c)</formula><p>where p(c) = n(c) N ,n(c) stands for the number of words in class c and N stands for the total number of words in data pool When a feature F is given, the conditional entropy for NE classes H(C|F) is defined by</p><formula xml:id="formula_3">H(C|F ) = − c∈C f ∈F p(c, f )log 2 p(c|f ) where p(c, f ) = n(c,f ) N ; p(c|f ) = n(c,f ) n(f )</formula><p>; n(c,f) stands for the number of words in class c with the feature value f and n(f) stands for the number of words with the feature value f</p><p>The information gain for NE classes and a feature I(C;F) can be calculated as:</p><formula xml:id="formula_4">I(C; F ) = H(C) − H(C|F )</formula><p>The information gain shows how the feature F contributed to the classification. I(C;F) equals 0 if feature F is completely independent of C and equals 1 if F gives sufficient information to label named entities.</p><p>To deal with different features, the information gain has to be normalized as the information ratio:</p><formula xml:id="formula_5">GR(C; F ) = I(C; F ) H(C)</formula><p>GR(C; F) ratios are close to 1 and 0 and can be compared even if the class entropies are different.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENT 3.1 Data set and models</head><p>GENIA corpus version 3.02 was used in this experiment. 36 classes were used to annotate the corpus.SVM <ref type="bibr" target="#b2">[2]</ref> was selected as the supervised model in the test and 400 different models were used. 40% of the corpus taken from the beginning was used for testing. 24% of the corpus (randomly sampled) was used to train the 400 different models. No cascaded entities existed in this experiment; only the longest entity was annotated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Results</head><p>Using the method described above, errors were successfully classified into three types: In the first sentence, "T cells" without "normal" was annotated as a cell type, while in the second sentence, "normal T cells" was annotated as a cell type in the original corpus.</p><p>In the result, a kind of errors were found which we called incomplete forms,For example, 1. &lt;proteinmolecule&gt; protein kinase C-alpha , -epsilon , and -zeta &lt;pro-teinmolecule&gt; 2. &lt; proteinmolecule &gt; LMP1 and 2 &lt; proteinmolecule &gt; Forms like '-epsilon', '-zeta' are in-complete, and they need to be recovered to their full terms of 'C-epsilon' and 'C-zeta'.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSIONS</head><p>Corpus error analysis is an important step in improving the accuracy of bioNER. The easiness and hardness statistics used here are effective in measuring the degree of hardness that a model has in recognizing one entity. We focused on the hard entities and this made it easy to get all errors in the experiment results. Also, this allowed us to select error categories for drill down analysis. The importance of a feature can be learned by using the information gain, and from the import features, evidence can be found to strengthen the results. We used these two methods together and it helped us to find inconsistent annotations in the GENIA corpus.</p></div>		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Classification and regression tree</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Olshen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Stone</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1984">1984</date>
			<publisher>Wadsworth International Group</publisher>
			<pubPlace>Belmont CA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">An introduction to support vector machines: and other kernel based learning methods</title>
		<author>
			<persName><forename type="first">N</forename><surname>Cristianini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shawe-Taylor</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<publisher>Cambridge University Press</publisher>
			<pubPlace>New York, NY</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">on both the system and evaluations comparative and func-tional genomics</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dingare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nission</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Finkel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Grover</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">A system for identifying named entities in biomedical text: How results from two evaluations reflect</title>
				<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Distinguishing easy and hard instances international</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Krymolowski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference On Computational Linguistics</title>
				<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Recognizing names in biomedical texts using hidden markov model and svm plus sigmoid</title>
		<author>
			<persName><forename type="first">G</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Joint workshop on Natural language Processing in Biomedicine and its Applica-tions(JNLPBA) 2004</title>
				<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
