<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Microposts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Feature Based Approach to Named Entity Recognition and Linking for Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Souvick Ghosh</string-name>
          <email>souvick.gh@gmail.co</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Promita Maitra Dipankar Das</string-name>
          <email>dipankar.dipnil2005@gmail.com</email>
          <email>promita.maitra@gmail.com</email>
          <email>promita.maitra@gmail.com dipankar.dipnil2005@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jadavpur University Jadavpur University</institution>
          ,
          <addr-line>Kolkata, West Bengal 700032 Kolkata, West Bengal 700032</addr-line>
          ,
          <country>India India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jadavpur University</institution>
          ,
          <addr-line>Kolkata, West Bengal 700032</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>6</volume>
      <fpage>74</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>In this paper, we describe our approach for Named Entity rEcognition and Linking Challenge (NEEL) at the #Microposts2016. The task is to automatically recognize entities and their types from English microposts, and link them to corresponding DBpedia 2015 entries. If the resources do not exist, we use NIL identifiers instead. The task is unique as twitter data is informal in nature with non-conformational spellings, random contractions and various other noises. For this task, we developed our system using a hybrid model. We have used various existing named entity recognition (NER) systems and combined them with our classifier to improve the results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In present day world, the relevance and importance of
various social media platforms are immeasurable. Microposts
such as tweets are limited in number of characters. However,
the conciseness of the text is barely a pointer to its
usefulness. From opinion mining during political campaigns to live
feeds during sports events, from product reviews to vacation
posts, Twitter is almost ubiquitous. Twitter promotes
instant communication. Most celebrities use it to form their
own digital presence. It also serves as a common forum
where people have the capability to rise from obscurity to
prominence through sharing of opinions. If we compare
microposts to any standard long document such as blog or news
articles, there exist a number of differences. Long articles
are usually well written. They follow a definite structure,
include headings and follow the rules of English grammar.
Microposts, on the other hand, are short, noisy and hardly
show any adherence to formal grammar. Presence of
extraneous characters like hashtags, abbreviations and the lack of
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full
citatCioonpoynrtihgehfirtst pcag2e0.1C6ophyerilgdhtsbyforacuotmhpoorn(esn)t/soowf ntheirs(ws)o;rkcowpnyeindgbypoetrhmersitttheadn
AoCnMly mfourstpbreihvoantoereadn. dAbasctraadcteimngiwcipthucrrpedoisteis.permitted. To copy otherwise, or
repPuublbislhis, htoedpoastsopnasretrvoerfs tohreto#reMdisitcrriboupteostotsl2is0ts1,6reqWuiorersksphriorpsppercoificepeedrminisgssio,n
aanvda/oilraabfleee.oRneliqnuesatspeCrmEisUsiRonsVforol-m16p9er1m(ishstiotnps:@//acmeu.orr-gw.s.org/Vol-1691)
structure and context makes it difficult to extract relevant
information. Due to this complexity, existing named entity
recognition systems (NER) do not perform very well with
tweet data. In NEEL challenge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] of #Microposts2016 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
we were required to automatically identify the named
entities and their types from Twitter data and link them to
the corresponding URIs of the DBpedia 2015-04 dataset1 .
Identifying the named entities and linking them to an
existing knowledge base enriches the text with more contextual
and semantic information. The mentions which could not
be linked to any existent DBpedia resource page were
recognized as NIL mentions. These mentions were clustered
to ensure that the same entity, which does not have a
corresponding entry in DBpedia, will be referenced with the same
NIL identifier. We have developed three systems for the
NEEL challenge, the major difference between the systems
being the features used for each run. Our system follows a
hybrid approach where Stanford Named Entity Recognition
system is used to identify the entity mentions. In the next
step, we run ARK Twitter Part-of-Speech Tagger to
identify the mentions which are missed formerly. We use our own
classifier to detect the type of the mentions. The named
entity linking to DBpedia resources is done using Babelfy2 . It
must be noted that we followed a feature-based approach for
the NEEL challenge. We also combined the existing tools
for Named Entity Recognition and Linking. Each of the
existing tools, like the Stanford NER, ARK Part-of-Speech
Tagger and Babelfy are state-of-the-art. We explored their
strengths and weaknesses in our work.
2.
      </p>
      <p>Our system follows four steps in pipeline as shown in
Figure 1. Mention detection in two stages, followed by mention
type classification, mention linking and NIL clustering.
1http://wiki.dbpedia.org/dbpedia-data-set-2015-04
2http://babelfy.org/</p>
    </sec>
    <sec id="sec-2">
      <title>Preprocessing</title>
      <p>From the training data, the mentions referring to the 7
types of entities were extracted to form 7 bags of words.
Using the initial words as seeds, the Wikipedia dumps were
crawled to expand the set of words. These lists represent
potential candidates for Named Entity mentions.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Detection of Entity Mentions</title>
      <p>In this step, the named entity mentions in the given tweets
are identified using two different approaches.
2.2.1</p>
      <sec id="sec-3-1">
        <title>Using Stanford Named Entity Recognizer</title>
        <p>The Stanford Named Entity Recognizer 3 was used to
extract the named entities. It is a CRF classifier implementing
linear chain Conditional Random Field. We use the 3 class
model to extract the named entities belonging to classes
Location, Person and Organization. While the recall was very
low, the precision of Stanford NER was quite good.
2.2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Using ARK Twitter Part-of-Speech Tagger</title>
        <p>
          The tweets were tokenized and assigned Part of Speech
tags using the ARK Twitter Part-of-Speech Tagger [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We
used the Twitter POS model with 25-tag tagset. The proper
nouns (NNP and NNS tagged as ˆ) and possessive proper
nouns (tagged Z) along with hashtags (#) and at-mentions
() were extracted as probable candidates for Named Entity
mentions. The mentions which were already identified using
Stanford NER are not considered for classification step as
they are already classified by the tagger itself. The rest of
the mentions are classified using our classifier in the next
step.
2.3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Classification of Entity Types</title>
      <p>
        In the machine learning software WEKA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we use the
following features to form a feature set and used the
Random Forest classifier to generate a pruned C4.5 Decision
Tree for 7-way classification of the named entity mentions
aˆA˘ S¸ Thing, Event, Character, Location, Organization,
Person and Product, while providing the identified noun entities
from previous steps as input. We checked the accuracy by
using various classifiers like NaA˜ ´rve Bayes, k-Nearest
Neighbour and Support Vector Machine on training data with a
10-fold cross validation. Random Forest gave the best
results.
2.3.1
      </p>
      <sec id="sec-4-1">
        <title>Features for Run 1</title>
        <p>The features used for Run 1 were as follows:
- Length of the mention string
- If the mention is all capitalized
- If the mention contains mixed case
- If the mention contains digits
- If internal period is present in mention string
- If present in list of Persons
- If present in list of Things
- If present in list of Events
- If present in list of Characters
- If present in list of Locations
- If present in list of Organizations
- If present in list of Products</p>
        <p>The above-mentioned lists are basically the bag of words
produced from the training data in the pre-processing step.
3http://nlp.stanford.edu/software/CRF-NER.shtml
2.3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Features for Run 2</title>
        <p>We made use of various text based features and bag of
words in Run 1. In Run 2, we explored various contextual
features in addition to the features of Run 1. So we combined
ten new features with the previous twelve features for Run
2. The ten additional features used in Run 2 were as follows:
- Context score for Person entity
- Context score for Location entity
- Context score for Character entity
- Context score for Organization entity
- Context score for Event entity
- Context score for Thing entity
- Context score for Product entity
- Frequency of Part-of-speech of mention
- Frequency of previous Part-of-speech
- Frequency of next Part-of-speech</p>
        <p>Context score of a particular mention is calculated for a
three word window of the mention. For each class, we have
the number of occurrences of each word in a three word
window. While calculating the feature value, we assign the
sum of the frequency of the words forming that fixed-size
window as the context score of mention.
2.3.3</p>
        <p>Run 3</p>
        <p>We wanted to apply a Feed-Forward neural network (also
called the back-propagation networks and multilayer
perceptron) to our feature set and see how it performs as these
kind of Artificial Neural Networks are useful in constructing
a function where the complexity of the feature values makes
the decision for building such a function by hand almost
impossible. We took the same features of Run 2 and employed
a feed-forward neural network based regression model with
5 hidden layers.</p>
        <p>For the previous two runs, i.e. Run1 and Run2, the tags
from Stanford NER were considered as the primary influence
over our classifier tags as its accuracy was quite good. For
Run 3 however, we omit the Stanford NER influence and let
only the neural network model do the tagging to check the
efficiency of our classifier.
2.4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Linking Mentions to DBpedia</title>
      <p>
        We used the Babelfy java API service [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to address the
task of entity linking to DBpedia 2015-04 resources. It is a
unified, multilingual, graph-based approach to Entity
Linking and Word Sense Disambiguation based on a loose
identification of candidate meanings coupled with a densest
subgraph heuristic which selects high-coherence semantic
interpretations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] . The Babelfy parameters that we tuned
according to our preferences are:
      </p>
      <p>setAnnotationType was set to identify both concepts and
named entities,
setMatchingType was set to exact matching,
setMultiTokenExpression was on to identify multi-word
tokens,</p>
      <p>setScoredCandidates was set in a way so that it obtains
only top-scored candidate from the disambiguation list.</p>
      <p>The rest of the parameters were kept to their default
value. The named entities identified by both Babelfy and
ARK Tagger were allowed to the linking stage. Initially, we
provided the original tweet texts as input to Babelfy. We
observed that the number of named entities and concepts
recognized and linked solely by Babelfy service was quite
low. The named entity recognition suffered because of the
noisy nature of tweet text. However, the accuracy of the
linked resources was satisfactory. So, we modified our
system by altering the tweets slightly. We removed the # and
and considered only the alphabets from an already
recognized named entity (tagged by the ARK tagger). After
successfully linking such named entities, we searched for more
entities which were syntactically similar to the previously
known entities. We linked these new entities to
corresponding DBpedia resources and also obtained the disambiguation
scores.
2.5</p>
    </sec>
    <sec id="sec-6">
      <title>Clustering of NIL Mentions</title>
      <p>The entities which could not be linked to any existing
DBpedia resource are supposed to have NIL identifiers so that
each NIL may be reused if there are multiple mentions in
the text which represent the same (s/similar/identical)
entity. We have considered only a spelling based approach here
to calculate the similarity between entities. Two unlinked
entities are taken to be similar if one of them contains the
other (letter only). In that case, the new entity is assigned
the same NIL identifier as that of the previous one.</p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
      <p>We evaluated our approach on the development set
consisting of 100 tweets made available by the organizers. In
Table 1 we have reported on the official metrics for entity
detection, tagging, clustering and linking. The precision,
recall and f-scores for the above-mentioned three runs show
that the Run 3 produces best results for the task with
fscore 0.674, 0.380, 0.252 and 0.646 in the categories Strong
Mention Match, Strong Typed Mention Match, Strong Link
Match and Mention Ceaf respectively.</p>
      <p>While all the Runs yield same score in other categories,
in Strong Typed Mention Match, we observe better result
for our feed-forward neural network model. Our systems for
the three different runs only differ in entity type
classification module while all other subtasks follow the same system
in all three cases. This results in same result in the last
two categories which were mainly the evaluation metrics for
linking and nil clustering.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>In this paper, we have described our approach for the
#Microposts2016 Named Entity rEcognition and Linking
(NEEL) challenge. We have developed a hybrid system
using the existing Named Entity Recognizer systems and
Twitter-specific Part-of-Speech Taggers in conjunction with
the classifier developed by us. The Named Entity Linking
was done mainly by using Babelfy, which performs as a
multilingual encyclopedic dictionary and a semantic network.
The performance of our system suffered because of certain
restrictions in time. The classification module was slightly
biased and the accuracy of classification suffered as result of
that. Identifying and selecting better features would have
improved results. Also a disambiguation module to treat
overlapping classes would have been useful. The accuracy of
the linking would also improve by taking a semantic
similarity approach using synonym sets for the mentions or context
word overlapping from the sets while NIL clustering.
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            OC´ onnor,
            <surname>D. Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mills</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heilman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Flanigan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Part-of-speech tagging for twitter: Annotation, features, and experiments</article-title>
          .
          <source>In ACL 2011</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The weka data mining software: An update</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>11</volume>
          :
          <fpage>231</fpage>
          -
          <lpage>244</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cecconi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>Multilingual word sense disambiguation and entity linking for everybody</article-title>
          .
          <source>In 13th International Semantic Web Conference, Posters and Demonstrations (ISWC</source>
          <year>2014</year>
          ), pages
          <fpage>25</fpage>
          -
          <lpage>28</lpage>
          ,
          <source>Riva del Garda</source>
          , Italy,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics (TACL)</article-title>
          , pages
          <fpage>231</fpage>
          -
          <lpage>244</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nadeau</surname>
          </string-name>
          .
          <article-title>Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision</article-title>
          .
          <source>PhD thesis</source>
          , Ottawa-Carleton Institute for Computer Science, School of Information Technology and Engineering,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Preo</surname>
          </string-name>
          <article-title>¸tiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            , D. Radovanovi´c,
            <given-names>A. E.</given-names>
          </string-name>
          <string-name>
            <surname>Cano-Basave</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Weller</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-S</surname>
          </string-name>
          . Dadzie, editors.
          <source>Proceedings, 6th Workshop on Making Sense of Microposts (#Microposts2016)</source>
          :
          <article-title>Big things come in small packages</article-title>
          ,
          <source>Montr´eal, Canada, 11th of Apr</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Mausam, and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named entity recognition in tweets: An experimental study</article-title>
          .
          <source>In EMNLP 2011</source>
          , pages
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Erp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Plu</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) Challenge</article-title>
          . In Preo¸tiuc-Pietro et al. [
          <volume>6</volume>
          ], pages
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Yosef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hoffart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spaniol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          . Hyena:
          <article-title>Hierarchical type classification for entity names</article-title>
          .
          <source>In Proceedings of COLING 2012: Posters</source>
          , pages
          <fpage>1361</fpage>
          -
          <lpage>1370</lpage>
          , Mumbai,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>