<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AMRITA - CEN@NEEL : Identification and Linking of Twitter Entities</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Barathi Ganesh H B, Abinaya N, Anand Kumar M, Vinayakumar R, Soman K P Centre for Excellence in Computational Engineering and Networking Amrita Vishwa Vidyapeetham</institution>
          ,
          <addr-line>Coimbatore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A short text gets updated every now and then. With the global upswing of such micro posts, the need to retrieve information from them also seems to be incumbent. This work focuses on the knowledge extraction from the micro posts by having entity as evidence. Here the extracted entities are then linked to their relevant DBpedia source by featurization, Part Of Speech (POS) tagging, Named Entity Recognition (NER) and Word Sense Disambiguation (WSD). This short paper encompasses its contribution to #Micropost2015 - NEEL task by experimenting existing Machine Learning (ML) algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Micro posts are a pool of knowledge with scope in
business analytics, public consensus, opinion mining,
sentimental analysis and author profiling and thus indispensable for
Natural Language Processing (NLP) researchers. People use
short forms and special symbols for easily conveying their
message due to the limited size of micro posts which has
eventually built complexity for traditional NLP tools [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Though there are number of tools, most of them rely on
least ML algorithms which are e↵ective for long texts than
short texts. Thus by providing su↵cient features to these
algorithms the objective can be achieved. We experimented
the NEEL task with the available NLP tools to evaluate
their e↵ect on entity recognition by providing special
features available in tweets.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>SELECTION OF ALGORITHMS</title>
    </sec>
    <sec id="sec-3">
      <title>Tokenization</title>
      <p>Tokenizing becomes highly challenging in micro posts due
to the absence of lexical richness. It includes special
symPCeorpmyirsisgihont tco m20ak1e5 dhieglidtalbyor ahuatrhdocro(sp)i/eoswonfearl(ls)o;r cpoaprtyionfgthpiserwmoirtktefdor
poenrlsyonfoarl oprricvlaatsesraonodm aucsaediesmgircanptuedrpwositehso.ut fee provided that copies are
nPoutbmliashdeedoradsipstarirbtuotefdthfoer#prMofiictroorpcoosmtsm20e1rc5iaWlaodrvkasnhtoapgeparnodcetehdaitncgosp,ies
baveaariltahbislenotice anadstCheEfUulRlcVitaotli-o1n39o5n (thhettfipr:s/t/pcaeguer.-Twos.coorpgy/oVtohle-r1w3i9s5e), to
online
republish, to post on servers or to redistribute to lists, requires prior specific
#Microposts2015, May 18th, 2015, Florence, Italy.
permission and/or a fee.</p>
      <p>
        Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
bols (:-), #, @user), abbreviations, short words (lol, omg),
misspelled words, repeated punctuations and unstructured
words (goooood nightttt, helloooo). Hence these micro posts
were fed to the dedicated twitter tokenizer which accounts
language identification, a lookup dictionary for list of names,
spelling correction and special symbols [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for e↵ective
tokenization.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>POS Tagger</title>
      <p>
        Due to the conversional nature of micro blogs with
nonsyntactic structure it becomes dicult in utilizing general
algorithms with traditional POS tags in Penn Treebank and
Wall Street Journal Corpus [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. O’Conner et al. used 25
POS tagset which includes dedicated tags (@user, hash tag,
G, URL, etc.) for twitter and reports 90% accuracy on
POS tagging [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The ability of resolving independent
assumptions and overcoming biasing problems make CRF as
promised supervised algorithm for sequence labeling
applications [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. TwitIE tagger: which utilizes CRF to build the
POS tagging model was thus used.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Named Entity Recognizer</title>
      <p>
        CRF and SVM produced promising outcome for sequence
labeling task which prompted us to use the same for our
experiment. Long range dependency of the CRF can also solve
Word Sense Disambiguation (WSD) problem over other
graphical models by avoiding label and casual biasing during
learning phase. Both CRF and SVM allow us to utilize the
complicated feature without modeling any dependency between
them. SVM is also well suited for sequence labeling task
since learning can be enhanced by incorporating cost models
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These advantages provide flexibility in building
expressive models with CRF suite and MALLET tools [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS AND OBSERVATION</title>
      <p>
        The experiment is conducted on i7 processor with 8GB
RAM and the flow of experiment is shown in Figure 1.
The training dataset consists of 3498 tweets with the unique
tweet id. These tweets have 4016 entities with 7 unique tags
namely Character, Event, Location, Organization, Person,
Product and Thing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. POS tag for the NER is obtained
from TwitIE tagger after tokenization which takes care of
the nature of micro posts and provides an outcome desired
by the POS tagger model. The tags are mapped to BIO
Tagging of named entities. Considering the entity as a phrase,
token at the beginning of the phrase is tagged as ‘B-(original
tag)’ and the token inside the phrase is tagged as ‘I-(original
tag)’. Feature vector constructed with POS tag and
additional 34 features like root word, word shapes, prefix and
sux of length 1 to 4, length of the token, start and end
of the sentence, binary features - whether the word contains
uppercase, lower case, special symbols, punctuations, first
letter capitalization, combination of alphabet with digits,
punctuations and symbols, token of length 2 and 4 , etc.
After constructing the feature vector for individual tokens in
the training set and by keeping bi-directional window of size
5, the nearby token’s feature statistics are also observed to
help the WSD. The final windowed training sets are passed
to the CRF and SVM algorithms to produce the NER model.
The development data has 500 tweets along with their id and
790 entities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The development data is also tokenized,
tagged and feature extracted as the training data for testing
and tuning the model. The developed model performance
is evaluated by 10- fold cross validation of training set and
validated against the development data. The accuracy is
computed as ratio of total number of correctly identified
entities to the total number of entities and tabulated in Table
1.
      </p>
      <p>Accuracy =</p>
      <p>P correctly identif ied entities
total entities</p>
      <p>
        MALLET incorporates O-LBFGS which is well suited for
log-linear models but shows reduced performance when
compared to CRFsuite which engulfs LBFGS for optimization
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. SVM’s low performance can be improved by
increasing the number of features which will not introduce
any over fitting and sparse matrix problem [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The final entity linking part is done by utilizing lookup
dictionary (DBpedia 2014) and sentence similarity. The
entity’s tokens are given to the look up dictionary which results
in few related links. The final link assigned to the entity is
based on maximum similarity score between related links
and proper nouns in the test tweet. Similarity score is
computed by performing dot product between unigram vectors
of proper nouns in the test tweet and the unigram vectors of
related links from lookup dictionary. Entity without related
links is assigned as NIL.</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION</title>
      <p>This experimentation is about sequence labeling for entity
identification from micro posts and extended with DBpedia
resource linking. By observing Table 1, it is clear that CRF
shows great performance and paves way for building a smart
NER model for streaming data application. Even though
CRF seems to be reliable, it is dependent on the feature
Tools
Mallet
SVM
CRFSuite
that has direct relation with NER accuracy. The utilized
TwitIE tagger shows promising performance in both the
tokenization and POS tagging phases. The special 34 features
extracted from the tweets improves ecacy by nearing 13%
greater than the model with absence of special features. At
linking part, this work is limited using dot product similarity
which could be improved by including semantic similarity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Rizzo</surname>
            , Giuseppe and
            <given-names>Cano</given-names>
          </string-name>
          <string-name>
            <surname>Basave</surname>
          </string-name>
          ,
          <article-title>Amparo Elizabeth and Pereira, Bianca and Varga, Andrea, Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge</article-title>
          .,
          <source>In 5th Workshop on Making Sense of Microposts (#Microposts2015)</source>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Rowe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Milan</given-names>
            <surname>Stankovic</surname>
          </string-name>
          and
          <string-name>
            <surname>Aba-Sah</surname>
            <given-names>Dadzie</given-names>
          </string-name>
          ,
          <source>Proceedings, 5th Workshop on Making Sense of Microposts (#Microposts2015)</source>
          :
          <article-title>Big things come in small packages</article-title>
          ,
          <source>Florence, Italy, 18th of May</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Dlugolinsky</surname>
            <given-names>S</given-names>
          </string-name>
          , Marek Ciglan and
          <string-name>
            <given-names>M</given-names>
            <surname>Laclavik</surname>
          </string-name>
          ,
          <article-title>Evaluation of named entity recognition tools on microposts</article-title>
          ,
          <source>INES</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>202</lpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bontcheva</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Derczynski</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funk</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greenwood</surname>
            <given-names>M A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            <given-names>D</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Aswani</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <article-title>TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text</article-title>
          ,
          <string-name>
            <surname>In</surname>
            <given-names>RANLP</given-names>
          </string-name>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>90</lpage>
          ,
          <year>2013</year>
          , September.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Brendan O'Connor</surname>
            ,
            <given-names>Michel</given-names>
          </string-name>
          <string-name>
            <surname>Krieger</surname>
          </string-name>
          and David Ahn,
          <article-title>TweetMotif: Exploratory Search and Topic Summarization for Twitter</article-title>
          , ICWSM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Finin</surname>
          </string-name>
          , Will Murnane, Anand Karandikar, Nicholas Keller and Justin Martineau,
          <article-title>Annotating named entities in Twitter data with crowdsourcing</article-title>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Gimpel</surname>
          </string-name>
          , et al,
          <article-title>Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments</article-title>
          ,
          <source>HLT'11</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>John</given-names>
            <surname>La</surname>
          </string-name>
          <article-title>↵erty,Andrew McCallum</article-title>
          and
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Pereira</surname>
          </string-name>
          , Conditional Random Fields:
          <article-title>Probabilistic Models for Segmenting and Labeling Sequence Data</article-title>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Chun-Nam John</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Thorsten Joachims,
          <article-title>Ron Elber and Jaroslaw Pillardy, Support vector training of protein alignment models</article-title>
          ,
          <source>in Research in Computational Molecular Biology</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Naoaki</surname>
            <given-names>Okazaki</given-names>
          </string-name>
          ,
          <article-title>CRFsuite: a fast implementation of Conditional Random Fields (CRFs</article-title>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>McCallum</surname>
            and
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Kachites</surname>
            ,
            <given-names>MALLET:</given-names>
          </string-name>
          <article-title>A Machine Learning for Language Toolkit</article-title>
          , http://mallet.cs.umass.edu,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Galen</given-names>
            <surname>Andrew and Jianfeng Gao</surname>
          </string-name>
          , Scalable Training of L1-Regularized
          <string-name>
            <surname>Log-Linear</surname>
            <given-names>Models</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Jorge</surname>
            <given-names>Nocedal</given-names>
          </string-name>
          ,
          <article-title>Updating Quasi-Newton Matrices with Limited Storage</article-title>
          ,
          <source>Mathematics of Computation</source>
          , Volume
          <volume>35</volume>
          ,
          <issue>Number151</issue>
          , pp:
          <fpage>773</fpage>
          -
          <lpage>782</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>