<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Entity Recognition and Language Identification with FELTS</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Pierre</forename><surname>Jourlin</surname></persName>
							<email>pierre.jourlin@univ-avignon.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Laboratoire d&apos;Informatique</orgName>
								<orgName type="institution">Université d&apos;Avignon</orgName>
								<address>
									<postCode>84911</postCode>
									<settlement>Avignon</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Entity Recognition and Language Identification with FELTS</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">BFAA9FD4D19BF644B4F7A37BC13C6478</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This working notes describe the experiments we conducted in the Microblog Cultural Contextualization Lab <ref type="bibr" target="#b1">[2]</ref> of CLEF 2017 <ref type="bibr" target="#b2">[3]</ref>. The microblog data is composed of very short texts, with very heterogeneous styles. Some of them are written in more than one language. We decided to takle the entity recognition problem by using a non-statistical, dictionary-based, multiword term extractor. On the other hand, our participation in the language identification task is based on word and character uni-gram probabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Task 1.5: Entity Recognition</head><p>In order to address the entity recognition problem, we make use of a free software that we developed in 2012 : FELTS (for Fast Extractor for Large Term Sets) <ref type="foot" target="#foot_0">1</ref> . It was designed to support very large multi-word term dictionaries such as the list of Wikipedia page titles. Using the Wikipedia's database dumps of march 1st 2017, we were able to provide FELTS with a corpus of :</p><p>-14,971,916 distinct terms, containing 4,811,345 distinct words for English. -3,384,979 distinct terms, containing 1,390,569 distinct words for French. -2,910,899 distinct terms, containing 978,297 distinct words for Spanish. -1,787,280 distinct terms, containing 800,612 distinct words for Portuguese.</p><p>In order to obtain a good level of efficiency, our approach is based on Minimal Perfect Hash Function, more specifically, the Compress, Hash and Displace algorithm <ref type="bibr" target="#b0">[1]</ref>, as it was implemented in the C Minimal Perfect Hashing Library (CMPH) V2.0<ref type="foot" target="#foot_1">2</ref> in 2012.</p><p>We processed the 63,192,980 micro-blog messages of task 1 with a 64 bits personal computer equipped with a Intel Core i7-2600 (an octo-core CPU running at 3.40GHz) and 7,8 Gb of RAM. The English term corpus and the associated hash function needed 3.6Gb of RAM. It took less that half a second to extract the 20,665 English terms contained in the 1095 task 1 "topics" and less than 8 hours to extract the 1,2 billion English terms contained in the 63 millions of micro-blog messages.</p><p>With such an approach, we found it difficult to choose a relevance score for each entity-language pair. Our system simply finds or does not find a Wikipedia entity in a text. However, we believe longer entities are more likely to indicate narrower senses and more relevant topics than shorter entities. We thus decided to simply score the multi-word terms with their character length. For each text of the test data, and each of the 4 languages, we provided the assessors with the 10 longuest extracted entities, ranked by decreasing character length. At the time this paper was written, we were not provided with relevance scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Task 1.2: Language identification</head><p>As an exploratory approach, we used a proven technique for language identification on long text : probabilistic decision based on word uni-grams. The probabilities of a language given a word were computed on two distinct corpora : The Wikipedia full text articles in all the 281 available languages (1st run) and the 63 millions of micro-blogs messages for task 1 (2nd run). Both corpora are very large but they both carry specific issues : high size disparities, distant language levels, multi-byte character encoding, lack of word boundaries, erroneous language identification, untranslated terms, multi-language texts, etc. As we realised that a word-based approach was bound to fail on languages such as Japanese or Korean were word boundaries are not explicit, we submitted a 3rd run based on character uni-gram probabilities.</p><p>We were provided with a partial manual evaluation of our first run. For only 121 out of 1095 microblog messages, our firt run identified a different language than the locale configuration of its author. For 90 of these 121 messages, the language identified by our 1st run was evaluated as correct. 11 of the remaining 31 erroneous identifications occurred on Japanese or Korean texts mixed with english multiword names. The last 20 erroneous indentifications are rather difficult to analyse and various causes such as the co-occurence of original and tanslated named entities can be suspected. Our third run (character uni-grams) seems to be slighlty better for Japanese or Korean but it still mostly fails on multi-language messages and is very weak at classifying languages that shares a same root. Our second run (word uni-grams according the author's locale configuration) found the correct language for 6 of 31 messages for which our first run failed. However, it is overall weaker than our first run.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>The language identification results look very promising. However, we believe that there is still room for improvement and that a combination of several methods, and a specific processing of named entities could help.</p></div>			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/jourlin/FELTS</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://cmph.sourceforge.net/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Compress, hash and displace</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">C</forename><surname>Botelho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Belazzougui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dietzfelbinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th European Symposium on Algorithms (ESA2009) Springer LNCS</title>
				<meeting>the 17th European Symposium on Algorithms (ESA2009) Springer LNCS</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">5757</biblScope>
			<biblScope unit="page" from="682" to="693" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Microblog Cultural Contextualization Lab Overview International Conference of the Cross-Language Evaluation Forum for European Languages Proceed</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ermakova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mothe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mulhem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-Y</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sanjuan</surname></persName>
		</author>
		<author>
			<persName><surname>Clef</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ings Springer LNCS volume</title>
				<meeting><address><addrLine>CLEF; Dublin</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J F</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017</title>
		<title level="s">Proceedings. Springer LNCS</title>
		<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">September 11-14, 2017</date>
			<biblScope unit="volume">10456</biblScope>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
