<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">NER from Tweets: SRI-JU System @MSM 2013</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Amitava</forename><surname>Das</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Utsab</forename><surname>Burman</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Sivaji</forename><surname>Bandyopadhyay</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Samsung Research India</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science &amp; Bangalore. Engineering</orgName>
								<orgName type="institution">Jadavpur University Kolkata</orgName>
								<address>
									<country>India, India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">NER from Tweets: SRI-JU System @MSM 2013</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A4D9ADBA44F8EDB2D8150DCC886F7A3C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Now a day Twitter has become an interesting source of experiment for different NLP experiments like entity extraction, user opinion analysis and more. Due to the noisy nature of user generated content it is hard to run standard NLP tools to obtain a better result. The task of named entity extraction from tweets is one of them. Traditional NER approaches on tweets do not perform well. Tweets are usually informal in nature and short (up to 140 characters). They often contain grammatical errors, misspellings, and unreliable capitalization. These unreliable linguistic features cause traditional methods to perform poorly on tweets. This article reports the author's participation in the Concept Extraction Challenge, Making Sense of micro posts (#MSM2013). Three different systems runs have been submitted. The first run is the baseline, second run is with capitalization and syntactic feature and the last run is with dictionary features. The last run yielded than all other. The accuracy of the final run has been checked is 79.57 (precision), 71.00 (recall) and 74.79 (f-measure) respectively.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Micro posts are the new form of communication in the web. Posts from different social networking sites and micro blogs reflect the present social, political and other events through user's text. Due to the limitation of message length (140 characters) and the noise of user generated content it is difficult to extract the concepts from them.</p><p>The different forms of user gen-erated noise makes Twitter text extreme noisy for standard NLP tasks. Such asa. Abbreviations and short forms of phonetic spelling (Examples: nite -"night", sayin -"saying"), inclusion of letter/number such as gr8-"great".</p><p>b. Acronyms (Examples: lol-"laugh out loud", iirc-"if I re-member correctly" etc). c. Typing error/ misspelling in tweets. Examples: wouls-"would", ridiculous-"ridiculous".</p><p>d. Punctuation omission/error. (Examples: im -"I'm", dont-"don't"). e. Non-dictionary slang in tweets. This category includes word sense disambiguation (WSD) problems caused by slang uses of standard words, e.g. that was well mint ("that was very good"). It also includes specific cultural reference or group-memes. f. User's wordplay in tweets. This includes phonetic spelling and intentional misspelling for verbal effect e.g. that was soooooo great ("that was so great").</p><p>g. Censor avoidance. This includes use of numbers or punctuation to disguise vulgarities, e.g. sh1t, f***, etc.</p><p>h. Presence of emoticons. While often recognized by a human reader, emoticons are not usually understood in NLP tasks such as Machine Translation and Information Retrieval. Examples: :) (Smiling face), &lt;3 (heart).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Data Table 1. NE Distribution of Training and Development Set</head><p>The work has been done on MSM-2013 dataset. The datasets were available in 2 subsets as training and test datasets. No development set has been provided therefore the training data was divided into 2 further subsets (in 70%-30% ratio). The name entities are considered as two types -single word NE and multiword NE. The division of the available training data was made based on the presence of 4 different types of name entities with each type single and multiword. The statistics of the above process is elaborated in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiment</head><p>Three different runs have been submitted. This is a CRF based system and the features are described below. Yamcha toolkit has been used for CRF implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Baseline</head><p>Our baseline system incorporates the part of speech tags, stemmed tokens to train the baseline classifier. For POS tags of a micro post, we used CMU-POS tagger tool <ref type="bibr" target="#b0">1</ref> which is specialized for tweets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Capitalization</head><p>Capitalization of tokens is one of the key features to recognize the name entities in micro posts. It has been used as a binary feature in the classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Predicate Rules</head><p>Generally the position of a name entity in a sentence is always close to the positions of functional words. For example in, of, near and etc. N-grams rules have been developed and used to train the classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Out of Vocabulary Words</head><p>Most of the name entities are not the dictionary words. We used Samsad<ref type="foot" target="#foot_1">2</ref> &amp; NICTA dictionary<ref type="foot" target="#foot_2">3</ref> in the experiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Gazetteers</head><p>For Location and MISC types two separate lists has been augmented. The LOC type consists of 220 country names and 100 popular city names. The MISC type has 110 NEs of different types. Mostly the error case in the Dev set.</p><p>We have experimented with series of features. Tweets are extremely noisy and therefore a concise set of named entity clue is very hard to finalize. Indeed person and organization categories are relatively naïve but location and miscellaneous category are very hard for a classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Performance</head><p>The performance results on the Dev set is been reported in the Table <ref type="table" target="#tab_0">2</ref>. It should be noted the actual result on the test is yet to be evaluated by the organizer of MSM.</p><p>We run multiple iterations to reach the final accuracy. Broadly they could be categorized in 5 genres, as reported below. Among those iterations 3 best runs (1, 3 and 5) have been submitted. The details of the features used in each runs are as below and the scores are elaborated in Table <ref type="table" target="#tab_0">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1)</head><p>Baseline: POS + Stem 2)</p><p>1 + Capitalization: Capitalization feature 3) 2 + N-Grams FW Predicates: in, of, or features 4)</p><p>3 + OOV 5) 4+Gazetters: LOC Dict + MISC Dict </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper we present a novel method for identification and classification of name entities based on the features. Though classifying named entities from twitter data is hard because of the noise and non-grammatical nature.</p><p>In this article we report our scores based on dev. set, we will incorporate the evaluation scores of #MSM2013 to support our evaluation framework.</p><p>Form the features that took part in our experiments, the gazetteer list, used in our experiment is small. We will try to include more in future.</p><p>We have observed that a-few Structural information can help to increase the results. For example -URL, Mention and Hash Tag. Our exploration is to find out more viable features that help to understand the semantics of micro post. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2 .</head><label>2</label><figDesc>Experiment Results on Development Set</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>of-speech tagging for twitter: Annotation, features, and experiments. CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2010. 2. Ritter, Alan, Sam Clark, and Oren Etzioni. "Named entity recognition in tweets: an experimental study." In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524-1534. Association for Computational Linguistics, 2011. 3. Finin, Tim, et al. "Annotating named entities in Twitter data with crowdsourcing." Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 2010. 4. Han, Bo, and Timothy Baldwin. "Lexical normalisation of short text messages: Makn sens a# twitter." In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 368-378. 2011.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.ark.cs.cmu.edu/TweetNLP/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://dsal.uchicago.edu/dictionaries/biswas-bengali/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.csse.unimelb.edu.au/~tim/etc/emnlp2012-lexnorm.tgz• #MSM2013 • Concept Extraction Challenge • Making Sense of Microposts III •</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Part</title>
		<author>
			<persName><forename type="first">Kevin</forename><surname>Gimpel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nathan</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Brendan O'</forename><surname>Connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dipanjan</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Mills</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jacob</forename><surname>Eisenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Heilman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dani</forename><surname>Yogatama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Flanigan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noah</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
