<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Memory-based Named Entity Recognition in Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antal van den Bosch</string-name>
          <email>a.vandenbosch@let.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toine Bogers</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Language Studies Radboud University Nijmegen NL-6200 HD Nijmegen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Royal School of Library Information Science Birketinget 6</institution>
          ,
          <addr-line>DK-2300 Copenhagen</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1019</volume>
      <fpage>40</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>We present a memory-based named entity recognition system that participated in the MSM-2013 Concept Extraction Challenge. The system expands the training set of annotated tweets with part-ofspeech tags and seedlist information, and then generates a sequential memory-based tagger comprised of separate modules for known and unknown words. Two taggers are trained: one on the original capitalized data, and one on a lowercased version of the training data. The intersection of named entities in the predictions of the two taggers is kept as the final output.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Named-entity recognition can be seen as a labeled chunking task, where all
beginning and ending words of names of predefined entity categories should be
correctly identified, and the category of the entity needs to be established. A
well-known solution to this task is to cast it as a token-level tagging task using
the IOB or BIO coding scheme [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Preferably, a structured learning approach
is used which combines accurate token-level decisions with a more global notion
of likely and syntactically correct output sequences.
      </p>
      <p>
        Memory-based tagging [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a generic machine-learning-based solution to
structured sequence processing that is applicable to IOB-coded chunking. The
algorithm has been implemented in MBT, an open source software package.3
MBT generates a sequential tagger that tags from left to right, taking its own
previous tagging decisions into account when generating a next tag. MBT
operates on two classifiers. First, the ‘known words’ tagger handles words in test
data which it has already seen in training data, and of which it knows the
potential tags. Second, the ‘unknown words’ tagger is invoked to tag words not seen
during training. Instead of the word itself it takes into account character-based
features of the word, such as the last three letters and whether it is capitalized
or not [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Named entity recognition in social media microtexts such as Twitter
messages, tweets, is generally approached with regular methods, but it is also
generally acknowledged that language use in tweets deviates from average written
language use in various aspects: it features more spelling and capitalization
variants than usual, and it may mention a larger variety of people, places and
organizations than, for instance, news. Most studies report relatively low scores
because of these factors [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4–6</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>System Architecture</title>
      <p>
        The enriched tweet is then processed by two MBT taggers. The first tagger is
trained on the original training data with all capitalization information intact;
the second tagger is trained on a lowercased version of the training set. The
taggers both assign BIO-tags to the tokens constituting named-entity chunks
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The two MBT modules generate partly overlapping predictions. Only the
named entity chunks that are fully identical in the output of the two modules,
i.e. their intersection, are kept. The result is a tweet annotated with named entity
chunks.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Resources</title>
      <p>
        The MBT modules are trained on the official (version 1.5) training data
provided for the MSM-2013 Concept Extraction Challenge.4, complemented with
the training and testing data of the CoNLL-2003 Shared Task [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and the
namedentity annotations in the ACE-2004 and ACE-2005 tasks.5 The list of
geographical names for the seedlist feature is taken from geonames.org;6 Lists of person
names and organization names are taken from the JRC Names corpus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].7.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Table 1 displays the overall scores of the final system, the intersection of the
two MBT systems, together with the scores of the two systems separately. A test
was run on a development set of 22,358 tokens containing 1,131 named entities
extracted from the MSM-2013 training set. The capitalized MBT system attains
the best recall, while the lowercased MBT attains the higher precision score. The
intersection of the two predictably boosts precision at the cost of a lower recall,
and attains the highest F-score of 61.21. If the gazetteer features are disabled,
overall precision increases slightly from 65.8 to 66.1, but recall decreases from
57.2 to 54.9, leading to a lower F-score of 60.0. This is a predictable effect of
gazetteers: they allow the recognition of more entities, but they import noise
due to the context-insensitive matching of names in incorrect entity categories.</p>
      <p>Table 2 lists the precision, recall, and F-scores on the four named entity types
distinguished in the challenge. Person names are recognized more accurately than
location and organization names; the miscellaneous category is hard to recognize.
4 http://oak.dcs.shef.ac.uk/msm2013/challenge.html
5 http://projects.ldc.upenn.edu/ace/
6 http://download.geonames.org/export/dump/allCountries.zip
7 http://optima.jrc.it/data/entities.gzip</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Tjong</given-names>
            <surname>Kim Sang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Veenstra</surname>
          </string-name>
          , J.:
          <article-title>Representing text chunks</article-title>
          .
          <source>In: Proceedings of EACL'99</source>
          ,
          <string-name>
            <surname>Bergen</surname>
          </string-name>
          , Norway (
          <year>1999</year>
          )
          <fpage>173</fpage>
          -
          <lpage>179</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zavrel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berck</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillis</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>MBT: A memory-based part of speech tagger generator</article-title>
          . In Ejerhed, E.,
          <string-name>
            <surname>Dagan</surname>
          </string-name>
          , I., eds.
          <source>: Proceedings of the Fourth Workshop on Very Large Corpora</source>
          ,
          <string-name>
            <surname>ACL SIGDAT</surname>
          </string-name>
          (
          <year>1996</year>
          )
          <fpage>14</fpage>
          -
          <lpage>27</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zavrel</surname>
          </string-name>
          , J., Van den Bosch, A.,
          <string-name>
            <surname>Van der Sloot</surname>
          </string-name>
          , K.:
          <article-title>MBT: Memory based tagger, version 3.0, reference guide</article-title>
          .
          <source>Technical Report ILK 07-04</source>
          , ILK Research Group, Tilburg University (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , et al.:
          <article-title>Named entity recognition in tweets: an experimental study</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics (
          <year>2011</year>
          )
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          : Twiner:
          <article-title>Named entity recognition in targeted twitter stream</article-title>
          .
          <source>In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <source>ACM</source>
          (
          <year>2012</year>
          )
          <fpage>721</fpage>
          -
          <lpage>730</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Named entity recognition for tweets</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology (TIST) 4</source>
          (
          <issue>1</issue>
          ) (
          <year>2013</year>
          )
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santorini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcinkiewicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Building a Large Annotated Corpus of English: the Penn Treebank</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>19</volume>
          (
          <issue>2</issue>
          ) (
          <year>1993</year>
          )
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Tjong</given-names>
            <surname>Kim Sang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>De Meulder</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition</article-title>
          . In Daelemans, W.,
          <string-name>
            <surname>Osborne</surname>
          </string-name>
          , M., eds.
          <source>: Proceedings of CoNLL-2003</source>
          , Edmonton, Canada (
          <year>2003</year>
          )
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Steinberger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouliquen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kabadjov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belyaeva</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>van der Goot</surname>
          </string-name>
          , E.:
          <article-title>Jrcnames: A freely available, highly multilingual named entity resource</article-title>
          .
          <source>In: Proceedings of the 8th International Conference 'Recent Advances in Natural Language Processing</source>
          . (
          <year>2011</year>
          )
          <fpage>104</fpage>
          -
          <lpage>110</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>