<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Filter-Stream Named Entity Recognition: A Case Study at the MSM2013 Concept Extraction Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Diego</forename><surname>Marinho De Oliveira</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Departamento de Ciência da Computação</orgName>
								<orgName type="institution">Universidade Federal de Minas Gerais</orgName>
								<address>
									<settlement>Belo Horizonte</settlement>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alberto</forename><forename type="middle">H F</forename><surname>Laender</surname></persName>
							<email>laender@dcc.ufmg.br</email>
							<affiliation key="aff0">
								<orgName type="department">Departamento de Ciência da Computação</orgName>
								<orgName type="institution">Universidade Federal de Minas Gerais</orgName>
								<address>
									<settlement>Belo Horizonte</settlement>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Adriano</forename><surname>Veloso</surname></persName>
							<email>adrianov@dcc.ufmg.br</email>
							<affiliation key="aff0">
								<orgName type="department">Departamento de Ciência da Computação</orgName>
								<orgName type="institution">Universidade Federal de Minas Gerais</orgName>
								<address>
									<settlement>Belo Horizonte</settlement>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Altigran</forename><forename type="middle">S</forename><surname>Da Silva</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">Universidade Federal do Amazonas</orgName>
								<orgName type="institution" key="instit2">Instituto de Computação</orgName>
								<address>
									<settlement>Manaus</settlement>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Filter-Stream Named Entity Recognition: A Case Study at the MSM2013 Concept Extraction Challenge</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C054BD5D855B046B6DC5F96981FA8F22</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Twitter</term>
					<term>Named Entity Recognition</term>
					<term>FS-NER</term>
					<term>CRF</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messages are terse, poorly worded and posted in many different languages. In this paper, we briefly describe a novel NER approach, called FS-NER (Filter Stream Named Entity Recognition) to deal with Twitter data, and present the results of a preliminary performance evaluation conducted to assess it in the context of the Concept Extraction Challenge proposed by the 2013 Workshop on Making Sense of Microposts -MSM2013. FS-NER is characterized by the use of filters that process unlabeled Twitter messages, being much more practical than existing supervised CRF-based approaches. Such filters can be combined either in sequence or in parallel in a flexible way. Our results show that, despite the simplicity of the filters used, our approach outperformed the baseline with improvements of 4.9% on average, while being much faster.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In this paper, we briefly describe a novel NER approach, called FS-NER (Filter Stream Named Entity Recognition), and present the results of a preliminary performance evaluation conducted to assess it in the context of the Concept Extraction Challenge proposed by the 2013 Workshop on Making Sense of Microposts -MSM2013 3 . Traditional approaches for Named Entity Recognition (NER) have demonstrated to be successful when applied to data obtained from typical Web documents, but they are ill suited to Twitter data <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>, since Twitter messages are composed of few words and usually written in informal, sometimes cryptic style. FS-NER is an alternative NER approach better suited to deal with Twitter data <ref type="bibr" target="#b0">[1]</ref>. In this approach, the NER process is viewed as a coarse grain Twitter message flow (i.e., a Twitter stream) controlled by a series of components, referred to as filters. A filter receives a Twitter message coming on the stream, performs specific processing in this message and returns information about possible entities in the message (i.e., each filter is responsible to recognize entities according to some specific criterion). Specifically, FS-NER employs five lightweight filters, exploiting nouns, terms, affixes, context and dictionaries. These filters are extremely fast and independent of grammar rules, and may be combined in sequence (emphasizing precision) or in parallel (emphasizing recall).</p><p>In our performance evaluation, we run a set of experiments using micropost data made available by the challenge organizers. Our aim in this challenge was, given a short message (i.e., a micropost), to recognize concepts generally defined as "abstract notions of things". Thus, for the purpose of the challenge our task was constrained to the extraction of entity concepts found in micropost data, characterised by a type and a value, and considering four entity types: Person, Organization, Location and Miscellaneous. We also employed a state-of-the-art CRF-based baseline. Our results show that, despite the simplicity of the filters used, our approach outperformed the baseline with improvements of 4.9% on average, while being much faster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Proposed Approach</head><p>FS-NER adopts filters that allow the execution of the NER task by dividing it into several recognition processes in a distributed way. Furthermore, FS-NER adopts a simple yet effective probabilistic analysis to choose the most suitable label for the terms in the message being processed. Because of this lightweight structure, FS-NER is able to process large amounts of data in real-time. In what follows, we briefly describe the main FS-NER aspects involved. More details can be found in <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Structure and Design</head><p>Let S = &lt; m 1 , m 2 , . . . &gt; be a stream of messages (i.e., tweets), where each m j in S is expressed by a pair (X, Y ), being X a list of terms [x 1 , x 2 , . . . x n ] that compound m j and Y a list of labels [y 1 , y 2 , . . . , y n ], such that each label y i is associated with the corresponding term x i and assumes one of the values in the set {Beginning, Inside, Last, Outside, UnitToken}. While X is known in advance for all messages in S, the values for the labels in Y are unknown and must be predicted. For example, the tweet "RT: I love Mary" could be represented by</p><formula xml:id="formula_0">([x 1 = RT:, x 2 = I, x 3 = love, x 4 = M ary], [y 1 = Outside, y 2 = Outside, y 3 = Outside, y 4 = U nitT oken]).</formula><p>To properly predict labels for Y , we need to provide representative data to generate a recognition model. In FS-NER, a filter is a processing component that estimates the probability of the labels associated with the terms of a message. A set of features is used to support the training of the filters (such features include information like the term itself, or if the first letter of the term is in uppercase). If a term in X satisfies one of these features, we say that the corresponding filter is activated by the term. Using the training set, we may count the number of times a filter is activated by a given term, and by inspecting the corresponding label we may calculate the likelihood of each pair {x i , y i } for each filter as expressed by the equation</p><formula xml:id="formula_1">P (y i = l|X ∧ F = k) = θ l (1)</formula><p>where F is a random variable indicating that a filter k is being used and θ l is the probability of associating the label l with the term x i . The probability θ l is given by Equation <ref type="formula" target="#formula_2">2</ref>, where T P is the number of true positive cases and F N is the number of false negative cases for the term x i .</p><formula xml:id="formula_2">θ l = T P T P + F N<label>(2)</label></formula><p>Thus, after trained, a filter becomes able to recognize entities present in the upcoming It is worth noting that each filter employs a different recognition strategy (i.e., a different feature), and thus different predictions are possible for different filters.</p><p>In sum, filters are simple abstract models that receive as input a list of terms X and a term x i ∈ X, and provides as output a set of labels with the associated likelihood, denoted by {l, θ l }. Thus, a filter can be defined by</p><formula xml:id="formula_3">(X, x i ) input −−−→ F output − −−− → {l, θ l }.</formula><p>During the recognition step, the set {l, θ l } is used to choose the most likely label for the term x i . However, if used in isolation, filters may not capture specific patterns that can be used for recognition. Fortunately, we may exploit filter combinations to boost recognition performance. Specifically, we may combine filters either in sequence (i.e., if we want to prioritize recognition precision), or in parallel (i.e., if we want to prioritize recognition recall). If combined in sequence, all filters must be activated by the input term, and the corresponding set {l, θ l } is obtained by treating the combined filters as an atomic one using Equation 1. In this case, it is expected that filters when combined sequentially are able to capture more specific patterns. In contrast, if combined in parallel, the combined filters are not considered as an atomic one. Instead, they simply represent the average of the corresponding likelihoods, as expressed by the equation</p><formula xml:id="formula_4">1 Z(F) K k=1 P (y i = l|X ∧ F = k)<label>(3)</label></formula><p>where Z(F) is a normalization function that receives as input a list of filters F and produces as output the number of filters activated by term x i . Once trained, the recognition models are used to select the most likely label for each term in the upcoming messages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Filter Engineering</head><p>In FS-NER, features are encapsulated by five basic filters. They are the term, context, affix, dictionary and noun filters.</p><p>The term filter estimates the probability of a certain term being an entity. This estimation is based on the number of times a specific term has been assigned as an entity during the training step. The context filter is specially important since it is able to capture unknown entities. Hence, this filter analyzes only the terms around an observed term x i considering a window of size n and infers whether it is an entity or not. The affix filter uses the fragments of an observation x i to infer if it is an entity. Advantageously, this filter can recognize entities that have similar affix to the entities analyzed before. Thus, this filter makes use of the prefix, infix or suffix of the observation to infer its label y i . The dictionary filter uses lists of names of correlated entities to infer whether the observed term is an entity. The dictionary is important to infer entities that do not appear in the training data. The noun filter only considers terms that have just the first letter capitalized to infer if the observed term is an entity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Evaluation</head><p>We performed the preliminary evaluation of our approach with the training data made available for the MSM2013 Concept Extraction Challenge. This data includes microposts that refer to entities of types Person (PER), Organization (ORG), Location (LOC) and Miscellaneous (MISC). For this, we performed a 5fold cross validation. To reduce noise, we applied simple preprocessing techniques like removing repeated letters and repeated adjacent terms within a micropost. We also used additional labeled Twitter data from <ref type="bibr" target="#b2">[3]</ref> for improving recognition results for entities of types PER and LOC. The standard filter combination adopted for FS-NER was the generalized term filter combination that includes all five proposed filters and presented the best performance in <ref type="bibr" target="#b0">[1]</ref>. In the term filter, the terms are case sensitive. The context filter, uses prefix and suffix contexts with a window of size three, which presented the best result for F 1 in all collections analyzed. The affix filter uses a prefix, infix and postfix size of 1 to 3. The dictionary filter, specifically, uses the same lists of names of correlated entities considered in <ref type="bibr" target="#b2">[3]</ref> and others created from Wikipedia pages. The CRF-based framework used as baseline was the one available at http:// crf.sourceforge.net, with features functionally similar to the FS-NER filters.</p><p>Table <ref type="table" target="#tab_0">1</ref> presents the obtained results. The line AVG-Diff shows the average difference between the FS-NER and CRF-based framework results for all entity types. These results show that, on average, FS-NER outperformed the CRFbased framework by 4.9% for the F 1 metric.</p><p>Regarding the test dataset labeling, we followed the same procedure adopted in the preliminary experiment discussed above. In addition, we trained our approach for each entity type separately and then submitted all results together. In case of any intersection between distinct entity types, we chose the entity type that presented the most precise result among them (i.e., PER &gt; LOC &gt; ORG &gt; MISC). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Concluision</head><p>In this paper, we have briefly described a novel NER approach, called FS-NER (Filter Stream Named Entity Recognition), and presented the results of a performance evaluation conducted to assess it in the context of the Concept Extraction Challenge proposed by the 2013 Workshop on Making Sense of Microposts -MSM2013. In this challenge, our task was constrained to the extraction of entity concepts found in micropost data, characterised by a type and a value, and considering four entity types: Person, Organization, Location and Miscellaneous. We also employed a state-of-the-art CRF-based baseline. Following previous results <ref type="bibr" target="#b0">[1]</ref>, our approach outperformed the baseline with improvements of 4.9% on average, while being much faster.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Results for FS-NER and the CRF-based framework on the challenge training dataset.</figDesc><table><row><cell cols="4">Entity Type Approach Precision Recall</cell><cell>F1</cell></row><row><cell>PER</cell><cell>FS-NER CRF</cell><cell>0.7508 0.7688</cell><cell cols="2">0.7546 0.7520 0.5350 0.6309</cell></row><row><cell>ORG</cell><cell>FS-NER CRF</cell><cell>0.6924 0.7188</cell><cell cols="2">0.4741 0.5612 0.4702 0.5685</cell></row><row><cell>LOC</cell><cell>FS-NER CRF</cell><cell>0.6961 0.7160</cell><cell cols="2">0.5400 0.6069 0.4656 0.5643</cell></row><row><cell>MISC</cell><cell>FS-NER CRF</cell><cell>0.5734 0.5610</cell><cell cols="2">0.3322 0.4185 0.2847 0.3777</cell></row><row><cell cols="2">AVG-Diff</cell><cell cols="3">-0.0130 0.0864 0.0493</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">• #MSM2013 • Concept Extraction Challenge • Making Sense of Microposts III •</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was partially funded by InWeb -The Brazilian National Institute of Science and Technology for the Web (grant MCT/CNPq 573871/2008-6), and by the authors' individual grants from CNPq, FAPEMIG and FAPEAM.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">FS-NER: A Lightweight Filter-Stream Approach to Named Entity Recognition on Twitter Data</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>De Oliveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">H F</forename><surname>Laender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Veloso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Da Silva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd International World Wide Web Conference (Companion Volume)</title>
				<meeting>the 22nd International World Wide Web Conference (Companion Volume)</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="597" to="604" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments</title>
		<author>
			<persName><forename type="first">K</forename><surname>Gimpel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mills</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Eisenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heilman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yogatama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Flanigan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Association for Computational Linguistics (Short Papers)</title>
				<meeting>the Association for Computational Linguistics (Short Papers)</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="42" to="47" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Named Entity Recognition in Tweets: An Experimental Study</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ritter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Mausam</surname></persName>
		</author>
		<author>
			<persName><surname>Etzioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1524" to="1534" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m">#MSM2013 • Concept Extraction Challenge • Making Sense of Microposts III</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
