<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Entity Extraction and Linking in #Microposts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Priyanka Sinha</string-name>
          <email>priyanka27.s@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Biswanath Barik</string-name>
          <email>biswanath.barik@tcs.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TCS Innovation Lab Kolkata, Indian Institute of Technology Kharagpur</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TCS Innovation Lab Kolkata, Tata Consultancy Services Limited</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The task of Named Entity Extraction and Linking (NEEL) challange 2015 [5] is considered as two successive tasks : Named Entity Extraction (NEE) from the tweets and Named Entity Linking (NEL) with DBpedia. For NEE task we use CRF++ [1] to create a language model on the given training data. For entity linking, we use DBpedia Spotlight.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Twitter</kwd>
        <kwd>Entity</kwd>
        <kwd>Linking</kwd>
        <kwd>Social Media</kwd>
        <kwd>DBpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The entity linking problem is well explored on normal text.
However, the existing techniques of entity linking do not
work well on short messages as the microblogs do not have
sucient context to classify (or disambiguate) the mentions.
In this work we have identified the mention by creating an
entity recognition model on the given training data and link
them to the DBpedia using DBpedia Spotlight.</p>
      <p>The rest of the paper is organized as follows: Section 2
describes our proposed approach which includes data
preparation and feature selection for named entity recognition model
Copyright c 2015 held by author(s)/owner(s); copying permitted
only for private and academic purposes.</p>
      <p>Published as part of the #Microposts2015 Workshop proceedings,
available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)
creation and entity linking method. Section 3 describes the
setup for web access. The result of our work is discussed in
Section 4. Section 5 illustrates the future scope of our work
followed by the references.</p>
    </sec>
    <sec id="sec-2">
      <title>2. METHODOLOGY</title>
      <p>
        In our approach we have divided the Named Entity
Extraction and Linking (NEEL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] task into two consecutive
subtasks, namely, Named Entity Extraction and Named Entity
Linking.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Named Entity Extraction</title>
      <p>The NER task is viewed here as a sequence labeling
problem. Given an input tweet, this step aims to identify the
word sequences that constitute a Named Entity and classify
each such entity into one of the predefined classes. For
entity recognition and classification task, we have developed a
model on the given training data using Conditional Random
Fields (CRFs) which is an undirected graphical model used
mainly for sequence labeling.</p>
      <p>
        As we have discussed in the previous section that the
context of the tweets is short, sometimes noisy and informal and
thus, their syntactic structures are not always comparable to
the normal texts. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] showed that the Part-of-Speech (POS)
features of surface tokens, Shallow Parsing (or Chunking)
information, Capitalization indicators etc. are useful for
improving NE recognition from tweets provided these modules
should be trained on twitter data. In this experiment, we
have added POS tag information to the training data
using Twitter NER[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], used word features and some binary
features like punctuations, digits, dots, hashtags, @,
capitalization indicators, existence of URLs, underscore, hyphen
etc. as features indicating or not indicating NEs for training
NE recognition model. We were motivated to use [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as it
allows to tokenize and distinguish between nouns and other
punctuations and tweet related artefacts well. We used [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
as it was relatively simple to adapt to our task.
      </p>
      <sec id="sec-3-1">
        <title>2.1.1 Data Preparation</title>
        <p>
          In the data preparation step, we have identified the word
sequences refering to a Named Entity(NE) in the training
data using the gold standard. The training data is
tokenized, part-of-speech(POS) tagged using Twitter NER[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
and converted into ’BIO’ format. For example, the NEs
identified in the tweet ID: 100678378755067904, tweet ”RT
@HadleyFreeman: NOTHING on US news networks about
London riots. Can you imagine the BBC ignoring, say, riots
in NYC? #americanewsfail” as follows
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.1.2 Feature Selection</title>
        <p>We have experimented with various feature types, various
window lengths and their combinations and come up with
the following feature set which gave us a good result. We
experimented with some context window lengths and 5 gave
us good results.</p>
        <p>• Contextual (Word) Features: a context window of size
five: Wi 2 Wi 1 Wi Wi+1 Wi+2</p>
        <p>Pi 2 Pi 1 Pi Pi+1 Pi+2
• Part-of-Speech (POS) Features: a context of size five:
• Word having Capitalization: binary feature
• Word having Punctuation: binary feature
• Is a Digit: binary feature
• Word having a Dot: binary feature
• Word having hashtag: binary feature</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Named Entity Linking</title>
      <p>For linking, we use the annotations returned by DBpedia
Spotlight REST API as the candidates and look for the
longest matching surface forms.</p>
      <p>We take the output of the NEE task and collect the named
entities that are extracted and their categories. To identify
correct start position we check for # and @. For each tweet,
using the B/I tags we find the longest consecutive entities
that make up a single entity. For example, in the tweet
above, ”London riots” would be treated as a single entity. For
each tweet, DBpedia Spotlight REST API is accessed with
confidence and support set to 0 with accepted return text
in XML. We use the DBpedia Spotlight’s annotate endpoint
to obtain all the links at once. For each entity returned
from DBpedia Spotlight, if the surface form is found to be
a substring of any of the entities and if a substring match is
found the corresponding URI is returned. For named entities
for which no match is found, if it is an existing nil entity then
the nil id is returned, else the nil counter is incremented and
returned.</p>
    </sec>
    <sec id="sec-5">
      <title>3. SETUP</title>
      <p>
        We used perl for transforming the data. We used the CMU
Twitter NLP[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] package for generating POS, CRF++[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
package and DBpedia Spotlight[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] REST API.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Web access</title>
      <p>
        We use JSP to create our REST API, which uses perl which
in turn uses curl to connect to DBpedia Spotlight[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] REST
endpoints.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4. EVALUATION</title>
      <p>The precision for strong link match with the training set
itself is 30.49%, recall is 30.29% and f1 is 30.39%. For the
tagging of correct entity type the precision with the training
set itself is 82.89%, recall 82.35% and f1 82.62%.
The precision for strong link match with the development set
is 14.82%, recall is 7.97% and f1 is 10.37%. For the tagging
of correct entity type the precision with the training set itself
is 41.65%, recall 22.41% and f1 29.14%.</p>
    </sec>
    <sec id="sec-8">
      <title>5. FUTURE WORK</title>
      <p>
        As we can see using the CMU POS tagger[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and CRF[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
discovers the entities well, but the way we do linking needs
more work.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Crf</surname>
          </string-name>
          <article-title>++: Yet another crf toolkit</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Daiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hokamp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          .
          <article-title>Improving eciency and accuracy in multilingual entity extraction</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Owoputi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Improved part-of-speech tagging for online conversational text with word clusters</article-title>
          .
          <source>In In Proceedings of NAACL</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Mausam, and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named entity recognition in tweets: An experimental study</article-title>
          .
          <source>In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
          , Edinburgh, Scotland,
          <string-name>
            <surname>UK</surname>
          </string-name>
          ,
          <year>July 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E. Cano</given-names>
            <surname>Basave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Varga</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge</article-title>
          . In M. Rowe,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stankovic</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-S</surname>
          </string-name>
          . Dadzie, editors,
          <source>5th Workshop on Making Sense of Microposts (#Microposts2015)</source>
          , pages
          <fpage>44</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>