<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Concept Extraction Challenge: University of Twente at #MSM2013</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mena</forename><forename type="middle">B</forename><surname>Habib</surname></persName>
							<email>m.b.habib@ewi.utwente.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of EEMCS</orgName>
								<orgName type="institution">University of Twente</orgName>
								<address>
									<settlement>Enschede</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maurice</forename><surname>Van Keulen</surname></persName>
							<email>m.vankeulen@ewi.utwente.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of EEMCS</orgName>
								<orgName type="institution">University of Twente</orgName>
								<address>
									<settlement>Enschede</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Concept Extraction Challenge: University of Twente at #MSM2013</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A9F5DAB95D8710CDB101FFFDEB82E479</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we use AIDA [8] disambiguation system to disambiguate the extracted named entities and hence find their type.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Our Approach</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Named Entity Extraction</head><p>For this task, we made use of two famous state of the art approaches for NER; CRF and SVM. We trained each of them in a different way as described below. The purpose of training is only for entity extraction rather recognition (extraction and classification). Results obtained from both are unionized to give the final extraction results.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Twitter is an important source for continuously and instantly updated information. The huge number of tweets contains a large amount of unstructured information about users, locations, events, etc. Information Extraction (IE) is the research field which enables the use of such a vast amount of unstructured distributed information in a structured way. Named Entity Recognition (NER) is a subtask of IE that seeks to locate and classify atomic elements (mentions) in text belonging to predefined categories such as the names of persons, locations, etc. In this paper we split the NER task into two separate tasks: Named Entity Extraction (NEE) which aims only to detect entity mention boundaries in text; and Named Entity Classification (NEC) which assigns the extracted mention to its correct entity type. For NEE, we used a hybrid approach of CRF and SVM to achieve better results. For NEC, we first apply AIDA disambiguation system <ref type="bibr" target="#b7">[8]</ref> to disambiguate the extracted named entities, then we use the Wikipedia categories of the disambiguated entities to find the type of the extracted mention.</p><p>Conditional Random Fields CRF is a probabilistic model that is widely used for NER <ref type="bibr" target="#b4">[5]</ref>. Despite the successes of CRF, the standard training of CRF can be very expensive <ref type="bibr" target="#b5">[6]</ref> due to the global normalization. In this task, we used an alternative method called empirical training <ref type="bibr" target="#b8">[9]</ref> to train a CRF model. The maximum likelihood estimation (MLE) of the empirical training has a closed form solution, and it does not need iterative optimization and global normalization. So empirical training can be radically faster than the standard training. Furthermore, the MLE of the empirical training is also a MLE of the standard training. Hence it can obtain competitive precision to the standard training. Tweet text is tokenized using special tweets tokenizer <ref type="bibr" target="#b0">[1]</ref>. For each token, the following features are extracted and used to train the CRF: (a) The Part of Speech (POS) tag of the word provided by a special POS tagger designed for tweets <ref type="bibr" target="#b0">[1]</ref>. (b) If the word initial character is capitalized or not. (c) If the word characters are all capitalized or not.</p><p>Support Vector Machines SVM is a machine learning approach used for classification and regression problems. For our task, we used SVM to classify if a tweet segment is a named entity or not. The training process takes the following steps:</p><p>1. Tweet text is segmented using the segmentation approach as described in <ref type="bibr" target="#b3">[4]</ref>. Each segment is considered a candidate for a named entity. We enriched the segments by looking up a Knowledge-Base (KB) (here we use YAGO <ref type="bibr" target="#b2">[3]</ref>) for entity mentions as described in <ref type="bibr" target="#b1">[2]</ref>. The purpose of this step is to achieve high recall. To improve the precision, we applied filtering hypotheses (such as removing segments that are composed of stop words or having verb POS). 2. For each tweet segment, we extract the following set of features in addition to those features mentioned in section 2.1: (a) The joint and the conditional probability of the segment obtained from Microsoft Web N-Gram services <ref type="bibr" target="#b6">[7]</ref>. (b) The stickiness of the segment as described in <ref type="bibr" target="#b3">[4]</ref>. The selection of the SVM features is based on the claim that disambiguation clues can help in deciding if the segment is a mention for an entity or not <ref type="bibr" target="#b1">[2]</ref>. 3. An SVM with RBF kernel is trained whether the candidate segment represents a mention of NE or not.</p><p>We take the union of the CRF and SVM results, after removing duplicate extractions, to get the final set of annotations. For overlapping extractions we select the entity that appears in Yago, then the one having longer length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Named Entity Classification</head><p>The purpose of NEC is to assign the extracted mention to its correct entity type. For this task, we first use the prior type probability of the given mention in the training data. If the extracted mention is out of vocabulary (does not appear in training set), we apply AIDA disambiguation system on the extracted mentions. AIDA provides the most probable entity for the mention. We get the Wikipedia categories of that entity from the KB to form an entity profile. Similarly, we use the training data to build a profile of Wikipedia categories for each of the entity types (PER, ORG, LOC and MISC).</p><p>To find the type of the extracted mention, we measure the document similarity between the entity profile and the profiles of the 4 entity types. We assign the mention to the type of the most similar profile.</p><p>If the extracted mention is out of vocabulary and is not assigned to an entity by AIDA we try to disambiguate the first token of it. If all those methods failed to find entity type for the mention we just assign "PER" type.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experimental Results</head><p>In this section we show our experimental results of the proposed approaches on the training data. All our experiments are done through a 4-fold cross validation approach for training and testing. We used Precision, Recall and F1 measures as evaluation criteria for those results. Table <ref type="table" target="#tab_0">1</ref> shows the NEE results along the extraction process phases. Twiner Seg. represents results of the tweet segmentation algorithm described in <ref type="bibr" target="#b3">[4]</ref>. Yago represents results of the surface matching extraction as described in <ref type="bibr" target="#b1">[2]</ref>. Twiner∪Yago represents results of merging the output of the two aforementioned methods. Filter(Twiner∪Yago) represents results after applying filtering hypothesis. The purpose of those steps is to achieve as much recall as possible with reasonable precision. SVM is trained as described in section 2.1 to find which of the segments represent true NE. CRF is trained and tested on tokenized tweets to extract any NE regardless of its type . CRF∪SVM is the unionized set of results of both CRF and SVM. Table <ref type="table" target="#tab_1">2</ref> shows the final results of both extraction with CRF∪SVM and entity classification using the method presented in section 2.2 (AIDA Disambiguation + Entity Categorization). It also shows the CRF results when trained to recognize (extract and classify) NE. We considered it as our baseline. Our method of separating the extraction and classification outperforms the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>In this paper, we present our approach for the IE challenge. We split the NER task into two separate tasks: NEE which aims only to detect entity mention boundaries in text; and NEC which assigns the extracted mention to its correct entity type. For NEE we used a hybrid approach of CRF and SVM to achieve better results. For NEC we used AIDA disambiguation system to disambiguate the extracted named entities and hence find their type.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>(c) The segment frequency over around 5 million tweets 1 . (d) If the segment appears in WordNet. (e) If the segment appears as a mention in Yago KB. (f) AIDA disambiguation system score for the disambiguated entity of that segment (if any).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Extraction Results</figDesc><table><row><cell>Twiner Seg. Yago Twiner∪Yago Filter(Twiner∪Yago) SVM CRF CRF∪SVM</cell><cell>Pre. Rec. 0.0997 0.8095 0.1775 F1 0.1489 0.7612 0.2490 0.0993 0.8139 0.1771 0.2007 0.8066 0.3214 0.7959 0.5512 0.6514 0.7157 0.7634 0.7387 0.7166 0.7988 0.7555</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Extraction and Classification Results</figDesc><table><row><cell>CRF AIDA Disambiguation + Entity Categorization</cell><cell>Pre. Rec. 0.6440 0.6324 0.6381 F1 0.6545 0.6900</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://wis.ewi.tudelft.nl/umap2011/ + TREC</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2011" xml:id="foot_1">Microblog track collection.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Part-of-speech tagging for twitter: annotation, features, and experiments</title>
		<author>
			<persName><forename type="first">K</forename><surname>Gimpel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mills</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Eisenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heilman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yogatama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Flanigan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 49th ACL conference, HLT &apos;11</title>
				<meeting>of the 49th ACL conference, HLT &apos;11</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="42" to="47" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Unsupervised improvement of named entity extraction in short informal context using disambiguation clues</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Habib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Keulen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Workshop on Semantic Web and Information Extraction (SWAIE)</title>
				<meeting>of the Workshop on Semantic Web and Information Extraction (SWAIE)</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Yago2: Exploring and querying world knowledge in time, space, context, and many languages</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hoffart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Suchanek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Berberich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">L</forename><surname>Kelham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>De Melo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of WWW 2011</title>
				<meeting>of WWW 2011</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Twiner: named entity recognition in targeted twitter stream</title>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Datta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B.-S</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 35th ACM SIGIR conference, SIGIR &apos;12</title>
				<meeting>of the 35th ACM SIGIR conference, SIGIR &apos;12</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="721" to="730" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 7th HLT-NAACL conference, CONLL &apos;03</title>
				<meeting>of the 7th HLT-NAACL conference, CONLL &apos;03</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="188" to="191" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Piecewise training of undirected models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Sutton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of UAI</title>
				<meeting>of UAI</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="568" to="575" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An overview of microsoft web n-gram corpus and applications</title>
		<author>
			<persName><forename type="first">K</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Thrasher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Viegas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B.-J</forename><forename type="middle">P</forename><surname>Hsu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the NAACL HLT 2010</title>
				<meeting>of the NAACL HLT 2010</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="45" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Aida: An online tool for accurate disambiguation of named entities in text and tables</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Yosef</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hoffart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bordino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Spaniol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1450" to="1453" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Closed form maximum likelihood estimator of conditional random fields</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hiemstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M G</forename><surname>Apers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wombacher</surname></persName>
		</author>
		<idno>TR-CTIT-13-03</idno>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
		<respStmt>
			<orgName>University of Twente</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
