<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Memory-based Named Entity Recognition in Tweets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Antal</forename><surname>Van Den Bosch</surname></persName>
							<email>a.vandenbosch@let.ru.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Centre for Language Studies</orgName>
								<orgName type="institution">Radboud University Nijmegen NL</orgName>
								<address>
									<postCode>6200 HD</postCode>
									<settlement>Nijmegen</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Toine</forename><surname>Bogers</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Royal School of Library Information Science</orgName>
								<address>
									<addrLine>Birketinget 6</addrLine>
									<postCode>DK-2300</postCode>
									<settlement>Copenhagen</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Memory-based Named Entity Recognition in Tweets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">04F9CE84D75662184C9A115AED7B9966</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present a memory-based named entity recognition system that participated in the MSM-2013 Concept Extraction Challenge. The system expands the training set of annotated tweets with part-ofspeech tags and seedlist information, and then generates a sequential memory-based tagger comprised of separate modules for known and unknown words. Two taggers are trained: one on the original capitalized data, and one on a lowercased version of the training data. The intersection of named entities in the predictions of the two taggers is kept as the final output.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Background</head><p>Named-entity recognition can be seen as a labeled chunking task, where all beginning and ending words of names of predefined entity categories should be correctly identified, and the category of the entity needs to be established. A well-known solution to this task is to cast it as a token-level tagging task using the IOB or BIO coding scheme <ref type="bibr" target="#b0">[1]</ref>. Preferably, a structured learning approach is used which combines accurate token-level decisions with a more global notion of likely and syntactically correct output sequences.</p><p>Memory-based tagging <ref type="bibr" target="#b1">[2]</ref> is a generic machine-learning-based solution to structured sequence processing that is applicable to IOB-coded chunking. The algorithm has been implemented in MBT, an open source software package. <ref type="foot" target="#foot_0">3</ref>MBT generates a sequential tagger that tags from left to right, taking its own previous tagging decisions into account when generating a next tag. MBT operates on two classifiers. First, the 'known words' tagger handles words in test data which it has already seen in training data, and of which it knows the potential tags. Second, the 'unknown words' tagger is invoked to tag words not seen during training. Instead of the word itself it takes into account character-based features of the word, such as the last three letters and whether it is capitalized or not <ref type="bibr" target="#b1">[2]</ref>.</p><p>Named entity recognition in social media microtexts such as Twitter messages, tweets, is generally approached with regular methods, but it is also generally acknowledged that language use in tweets deviates from average written language use in various aspects: it features more spelling and capitalization variants than usual, and it may mention a larger variety of people, places and organizations than, for instance, news. Most studies report relatively low scores because of these factors <ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System Architecture</head><p>Figure <ref type="figure" target="#fig_0">1</ref> displays a schematic overview of the architecture of our system. A new incoming tweet is first enriched by seed list information, that for each token in the tweet checks whether it occurs as a geographical name, or as part of a person or organization name in gazetteer lists for these three types of entities. This produces a token-level code that is either empty (-) or any combination of letters representing occurrence in a person name list (P), a geographical name list (G), or an organizational name list (O). We provide details on the resources we used in our system in Section 3. The tweet is also part-of-speech tagged by a memory-based tagger trained on the Wall Street Journal part of the Penn Treebank <ref type="bibr" target="#b6">[7]</ref>, producing Penn Treebank part-of-speech tags for all tokens at an estimated accuracy of 95.9%. The enriched tweet is then processed by two MBT taggers. The first tagger is trained on the original training data with all capitalization intact; the second tagger is trained on a lowercased version of the training set. The taggers both assign BIO-tags to the tokens constituting named-entity chunks <ref type="bibr" target="#b0">[1]</ref>.</p><p>The two MBT modules generate partly overlapping predictions. Only the named entity chunks that are fully identical in the output of the two modules, i.e. their intersection, are kept. The result is a tweet annotated with named entity chunks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Resources</head><p>The MBT modules are trained on the official (version 1.5) training data provided for the MSM-2013 Concept Extraction Challenge. <ref type="foot" target="#foot_2">4</ref> , complemented with the training and testing data of the CoNLL-2003 Shared Task <ref type="bibr" target="#b7">[8]</ref> and the namedentity annotations in the ACE-2004 and ACE-2005 tasks. <ref type="foot" target="#foot_3">5</ref> The list of geographical names for the seedlist feature is taken from geonames.org;<ref type="foot" target="#foot_4">6</ref> Lists of person names and organization names are taken from the JRC Names corpus <ref type="bibr" target="#b8">[9]</ref>.<ref type="foot" target="#foot_5">7</ref> . Table <ref type="table" target="#tab_0">1</ref> displays the overall scores of the final system, the intersection of the two MBT systems, together with the scores of the two systems separately. A test was run on a development set of 22,358 tokens containing 1,131 named entities extracted from the MSM-2013 training set. The capitalized MBT system attains the best recall, while the lowercased MBT attains the higher precision score. The intersection of the two predictably boosts precision at the cost of a lower recall, and attains the highest F-score of 61.21. If the gazetteer features are disabled, overall precision increases slightly from 65.8 to 66.1, but recall decreases from 57.2 to 54.9, leading to a lower F-score of 60.0. This is a predictable effect of gazetteers: they allow the recognition of more entities, but they import noise due to the context-insensitive matching of names in incorrect entity categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>Table <ref type="table" target="#tab_1">2</ref> lists the precision, recall, and F-scores on the four named entity types distinguished in the challenge. Person names are recognized more accurately than location and organization names; the miscellaneous category is hard to recognize. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. The architecture of our system.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Overall named entity recognition scores by the system and its components</figDesc><table><row><cell cols="2">Component Precision Recall F-score</cell></row><row><cell>Capitalized</cell><cell>54.62 63.75 58.83</cell></row><row><cell>Lowercased</cell><cell>57.38 62.86 60.00</cell></row><row><cell>Intersection</cell><cell>65.82 57.21 61.21</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Overall named entity recognition scores on the four entity types</figDesc><table><row><cell cols="2">Named entity type Precision Recall F-score</cell></row><row><cell>Person</cell><cell>75.90 69.52 72.57</cell></row><row><cell>Location</cell><cell>54.95 44.25 49.02</cell></row><row><cell>Organization</cell><cell>47.46 39.25 42.97</cell></row><row><cell>Miscellaneous</cell><cell>17.54 11.39 13.85</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">MBT is available in Debian Science: Linguistics, http://blends.alioth.debian. org/science/tasks/linguistics and at http://ilk.uvt.nl/mbt. The software is documented in<ref type="bibr" target="#b2">[3]</ref>.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1">• #MSM2013 • Concept Extraction Challenge • Making Sense of Microposts III •</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">http://oak.dcs.shef.ac.uk/msm2013/challenge.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">http://projects.ldc.upenn.edu/ace/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">http://download.geonames.org/export/dump/allCountries.zip</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">http://optima.jrc.it/data/entities.gzip</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Representing text chunks</title>
		<author>
			<persName><forename type="first">Tjong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sang</forename></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Veenstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of EACL&apos;99</title>
				<meeting>EACL&apos;99<address><addrLine>Bergen, Norway</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="173" to="179" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">MBT: A memory-based part of speech tagger generator</title>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zavrel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Berck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gillis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourth Workshop on Very Large Corpora, ACL SIGDAT</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Ejerhed</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Dagan</surname></persName>
		</editor>
		<meeting>the Fourth Workshop on Very Large Corpora, ACL SIGDAT</meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page" from="14" to="27" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zavrel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Van Den Bosch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Van Der Sloot</surname></persName>
		</author>
		<idno>ILK 07-04</idno>
		<title level="m">MBT: Memory based tagger, version 3.0, reference guide</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
		<respStmt>
			<orgName>ILK Research Group, Tilburg University</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Named entity recognition in tweets: an experimental study</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ritter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1524" to="1534" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Twiner: Named entity recognition in targeted twitter stream</title>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Datta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">S</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 35th international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="721" to="730" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Named entity recognition for tweets</title>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Intelligent Systems and Technology (TIST)</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">3</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Building a Large Annotated Corpus of English: the Penn Treebank</title>
		<author>
			<persName><forename type="first">M</forename><surname>Marcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santorini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Marcinkiewicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="313" to="330" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition</title>
		<author>
			<persName><forename type="first">Tjong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sang</forename></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>De Meulder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CoNLL-2003</title>
				<editor>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Osborne</surname></persName>
		</editor>
		<meeting>CoNLL-2003<address><addrLine>Edmonton, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="142" to="147" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Jrcnames: A freely available, highly multilingual named entity resource</title>
		<author>
			<persName><forename type="first">R</forename><surname>Steinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pouliquen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kabadjov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Belyaeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Van Der Goot</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8th International Conference &apos;Recent Advances in Natural Language Processing</title>
				<meeting>the 8th International Conference &apos;Recent Advances in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="104" to="110" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
