<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>English.
-</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Entity Recognition and Language Identi cation with FELTS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre Jourlin</string-name>
          <email>Pierre.Jourlin@univ-avignon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire d'Informatique, Universite d'Avignon</institution>
          ,
          <addr-line>84911 Avignon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <volume>3</volume>
      <issue>384</issue>
      <abstract>
        <p>This working notes describe the experiments we conducted in the Microblog Cultural Contextualization Lab [2] of CLEF 2017 [3]. The microblog data is composed of very short texts, with very heterogeneous styles. Some of them are written in more than one language. We decided to takle the entity recognition problem by using a non-statistical, dictionary-based, multiword term extractor. On the other hand, our participation in the language identi cation task is based on word and character uni-gram probabilities. In order to address the entity recognition problem, we make use of a free software that we developed in 2012 : FELTS (for Fast Extractor for Large Term Sets)1. It was designed to support very large multi-word term dictionaries such as the list of Wikipedia page titles. Using the Wikipedia's database dumps of march 1st 2017, we were able to provide FELTS with a corpus of : In order to obtain a good level of e ciency, our approach is based on Minimal Perfect Hash Function, more speci cally, the Compress, Hash and Displace algorithm[1], as it was implemented in the C Minimal Perfect Hashing Library (CMPH) V2.02 in 2012. We processed the 63,192,980 micro-blog messages of task 1 with a 64 bits personal computer equipped with a Intel Core i7-2600 (an octo-core CPU running at 3.40GHz) and 7,8 Gb of RAM. The English term corpus and the associated hash function needed 3.6Gb of RAM. It took less that half a second to extract the 20,665 English terms contained in the 1095 task 1 "topics" and less than</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>8 hours to extract the 1,2 billion English terms contained in the 63 millions of
micro-blog messages.</p>
      <p>With such an approach, we found it di cult to choose a relevance score for
each entity-language pair. Our system simply nds or does not nd a Wikipedia
entity in a text. However, we believe longer entities are more likely to indicate
narrower senses and more relevant topics than shorter entities. We thus decided
to simply score the multi-word terms with their character length. For each text
of the test data, and each of the 4 languages, we provided the assessors with
the 10 longuest extracted entities, ranked by decreasing character length. At the
time this paper was written, we were not provided with relevance scores.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Task 1.2: Language identi cation</title>
      <p>As an exploratory approach, we used a proven technique for language
identi cation on long text : probabilistic decision based on word uni-grams. The
probabilities of a language given a word were computed on two distinct corpora
: The Wikipedia full text articles in all the 281 available languages (1st run)
and the 63 millions of micro-blogs messages for task 1 (2nd run). Both corpora
are very large but they both carry speci c issues : high size disparities, distant
language levels, multi-byte character encoding, lack of word boundaries,
erroneous language identi cation, untranslated terms, multi-language texts, etc. As
we realised that a word-based approach was bound to fail on languages such as
Japanese or Korean were word boundaries are not explicit, we submitted a 3rd
run based on character uni-gram probabilities.</p>
      <p>We were provided with a partial manual evaluation of our rst run. For only
121 out of 1095 microblog messages, our rt run identi ed a di erent language
than the locale con guration of its author. For 90 of these 121 messages, the
language identi ed by our 1st run was evaluated as correct. 11 of the
remaining 31 erroneous identi cations occurred on Japanese or Korean texts mixed
with english multiword names. The last 20 erroneous indenti cations are rather
di cult to analyse and various causes such as the co-occurence of original and
tanslated named entities can be suspected. Our third run (character uni-grams)
seems to be slighlty better for Japanese or Korean but it still mostly fails on
multi-language messages and is very weak at classifying languages that shares a
same root. Our second run (word uni-grams according the author's locale
conguration) found the correct language for 6 of 31 messages for which our rst
run failed. However, it is overall weaker than our rst run.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>The language identi cation results look very promising. However, we believe that
there is still room for improvement and that a combination of several methods,
and a speci c processing of named entities could help.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Botelho</surname>
            ,
            <given-names>F. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belazzougui</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dietzfelbinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Compress, hash and displace</article-title>
          .
          <source>In Proceedings of the 17th European Symposium on Algorithms (ESA2009) Springer LNCS 5757</source>
          ,
          <fpage>682</fpage>
          -
          <lpage>693</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ermakova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
          </string-name>
          , J.-Y., and
          <string-name>
            <surname>SanJuan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>CLEF 2017 Microblog Cultural Contextualization Lab Overview International Conference of the Cross-Language Evaluation Forum for European Languages Proceedings Springer</article-title>
          LNCS volume, Springer, CLEF
          <year>2017</year>
          , Dublin.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G. J. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawless</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N. Experimental</given-names>
          </string-name>
          <string-name>
            <surname>IR Meets Multilinguality</surname>
          </string-name>
          ,
          <source>Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          , Proceedings. Springer LNCS 10456.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>