<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Extraction from Social Media using Machine Learning Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso UPV</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain prosso@dsic.upv.es</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sivaji Bandyopadhyay Jadavpur University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sombuddha Choudhury Jadavpur University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Somnath Banerjee Jadavpur University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Sudip Kumar Naskar Jadavpur University</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>103</fpage>
      <lpage>106</lpage>
      <abstract>
        <p>In this work, we describe an automatic entity extraction system for social media content in English as part of our participation in the shared task on Entity Extraction from Social Media Text in Indian Languages (ESM-IL) organized by Forum for Information Retrieval Evaluation (FIRE) in 2015. Our method uses simple features such as window of words, capitalization, dictionary word, part of speech tags, hashtag, etc. The performance of the system has been evaluated against the testset released in the FIRE 2015 shared task on ESM-IL. Experimental results show encouraging performance in terms of precision, recall and F-measure.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity extraction</kwd>
        <kwd>named entity</kwd>
        <kwd>social media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Named entities refer to speci c concepts which are not
listed in the grammars or the lexicons. Automatic identi
cation and classi cation of NEs bene t text processing due
to their signi cant presence in the text documents.
Recognition of named entity is a task that seeks to locate and
classify NEs in a text into prede ned categories such as the
names of persons, organizations, locations, expressions of
times, quantities, etc. The NE recognition task has
important signi cance in many NLP applications such as
Machine Translation, Question-Answering, Automatic
Summarization, Information Extraction, etc. On the other hand,
with the advent of smart phones more people are using
social media such as twitter, facebook to comment on people,
products, services, organizations, goverments, etc. Thus,
NE recognition on various social media data such as
websites, blogs, tweets, emails, chats, social media posts has
gained signi cance recently [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>3.</p>
    </sec>
    <sec id="sec-3">
      <title>DATA</title>
      <p>In this section, we describe the dataset provided to the
shared task participants for the task. We were provided
with two sets of data, namely training set and test set. For
the training set, we were provided with two di erent les
one of which contained a collection of tweets along with their
tweet-id's and user-id's; the other was an annotation le that
contained the named entities and their tags for the tweets
in the raw le. The annotation le consisted of 6 columns
separated by tabs: &lt;Tweet ID User ID NETAG NE Index
Length&gt;</p>
      <p>For example: Tweet ID:123456789012345678 User Id:1234567890
NETAG:ORGANIZATION NE:SonyTV Index:43 Length:6</p>
      <p>The training corpus consists of 5941 tweets and 23483
unique tokens. The di erent NEs provided in the training
annotation le and their corresponding counts are shown in
table 1. The testset contains 9595 tweets and 39464 unique
tokens.
4.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-5">
      <title>Pre-processing</title>
      <p>
        For the raw training le, we rst separated the tweet text
from the user ids as they were redundant. The tweet ids were
however preserved as they serve as keys to the tweet text as
each tweet has a unique tweet id. In the same way, we
removed the presence of all urls and hyperlinks from the tweet
text. After this we applied a POS tagger on the new le
generated. For POS tagger we used ark-tweet-nlp-0.3.21 1
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to generate pos tags in CoNLL format. From the
annotation le, for each tweet id we get the list of words that
1http://www.ark.cs.cmu.edu/TweetNLP/
are named entities and their associated NE tags. We scan
every word of every single tweet and assign that word its
corresponding named entity tag. We used the BIO type of
chunking for this purpose. If a sequence of words belongs
to an NE with a particular NE tag, we marked the rst
word of the entity as NE tag B(beginning) and the
subsequent words as NE tag I(intermediate). For words that
are not NEs, we tag them as O(other). For example, if we
have a tweet like Chief Minister Arvind Kejriwal Wishes
Luck to Special Olympics Participants and the annotation
le has entries like \NETAG:PERSON NE:Chief Minister
Arvind Kejriwal", then the tagging of the tweet is done
as: \Chief PERSON B Minister PERSON I Arvind
PERSON I Kejriwal PERSON I Wishes O Luck O to O
Special O Olympics O Participants O". The annotation le has
a total of 22 classes/tags. By our format of encoding the
total number of tagged classes becomes 2 22 + 1 = 45 classes
(2 tags, i.e. B and I, for each class and 1 for O). The same
tagging format was applied for the test le.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Classification Features</title>
      <p>We have used simple features for classi cation which are
described in the next subsections.
4.2.1</p>
      <sec id="sec-6-1">
        <title>Window of Words</title>
        <p>
          The unique words in the corpus are mapped into integer
vector, i.e., each unique word is assigned an integer value.
Various work [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ][
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] on NER employed preceding or following
words of the target word to determine its category.
Therefore, we also employed a window of words approach which
have the size 3. In our work, previous word and next word
along with target word are considered to build the window.
4.2.2
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Part of Speech (POS) Tag</title>
        <p>
          The POS of the target word and surrounding words may
be useful feature for NER. In the context of NER, noun tag is
very useful because NEs are always noun phrases. We have
used a POS tagger[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] specially developed for social media
text.
4.2.3
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Capitalization</title>
        <p>Although this feature is not that e ective for tweets or
user generated content in social media, still a fairly large
number of entities that are capitalized turn out to be named
entities. Thus we included this feature as a binary feature
that can be formally de ned as:
Capitalization(word) =
NEs. Therefore, we included this feature as a binary feature
which is de ned as:
starts with Hashtag(word) =
4.2.6</p>
      </sec>
      <sec id="sec-6-4">
        <title>At the Rate</title>
        <p>This is similar to the previous feature and can be de ned
as :
starts with attherate(word) =
4.2.7</p>
      </sec>
      <sec id="sec-6-5">
        <title>Dictionary Word</title>
        <p>This feature checks whether a given word has its presence
in the dictionary or not. We incorporated the English
dictionary provided by PyEnchant 2 (an open source dictionary
available for python). The main motivation behind using
this feature is that words that appear in the dictionary have
a fairly low probability of qualifying as a named entity. This
is again a binary feature that can be described as:
is in Dictionary(word) =
4.3</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Classifiers</title>
      <p>In this work, we have employed in total four di erent
classi ers, namely Nave Bayes (NB), Conditional Random Field
(CRF), Margin Infused Relaxed algorithm (MIRA) and
Decision Tree (DT). For Nave Bayes and Decision Tree, we
used the WEKA toolkit 3. For CRF and MIRA we used the
open source implementation of CRF++ toolkit 4 and
miralium5.
4.4</p>
    </sec>
    <sec id="sec-8">
      <title>Output Generation</title>
      <p>After the classi ers generated the corresponding NE tags,
post-processing was done to convert the predicted tagged le
into the same format as provided in the training annotation
le. This was simply a reverse procedure of what we did for
pre-processing of the training le. For every word that was
tagged as one of the 45 named entity tags, we entered that
word and the corresponding tweet-id, user-id, starting index
and length of the entity into the output le. When we had
a chunk of words (where a particular B tag was followed by
1 or more I tags) we simply clubbed those words together
as a single named entity until we reached a word with O tag
or B tag or the end of the tweet. For multiple word NEs,
we considered the starting index of the rst word (with B
tag) as Index entry and the total length of all the words in
the NE (including blank spaces) as the length of that NE.
Another very important part of the post-processing phase
was proper identi cation of the tagged tweets since in the
tagged le obtained from the classi ers there is no way to
identify which tweet belongs to which tag. Again for this we
maintained a line{tweet correspondence where the starting
word of a tweet had a one-one correspondence to the line
number in which it appeared in the le obtained from the
classi er.
2https://pythonhosted.org/pyenchant/
3http://www.cs.waikato.ac.nz/ml/weka/
4https://taku910.github.io/crfpp/
5https://code.google.com/p/miralium/</p>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENT</title>
      <p>This section basically emphasizes on exemplifying the
systematic steps performed in generating the training models
using the four di erent classi ers as mentioned in Section
4.3 and then identifying the NEs and their corresponding
NE tags in the given test le using the trained models
generated from the training les.
5.1</p>
    </sec>
    <sec id="sec-10">
      <title>Training the Classifiers</title>
      <p>We performed pre-processing on the two training les
provided to us and the detailed description of the pre-processing
is discussed in Section 4.1. We have prepared four models
with all of the features (discussed in Section 4.2) using the
four classi ers, i.e., NB, DT, CRF and MIRA.</p>
      <p>The descriptions of the models follow:
Model 1: Generated using the CRF classi er.</p>
      <p>Model 2: Generated using the MIRA Classi er.
Model 3: Generated using the J-48 Classi er.</p>
      <p>Model 4: Generated using the Nave Bayes Classi er.</p>
      <p>The NE tags together with their frequency of occurrence
in the training data are shown in table 1.</p>
      <p>Again the test le was made to undergo the same set of
operations as the training phase where the raw test le was
converted into a format suitable to be evaluated by the models
generated. These set of operations were the pre-processing
steps and the feature extraction steps. Then we ran our test
le using each of the 4 models and generated 4 test runs
where test run1 was generated using model1 (CRF), test
run2 using model2 (MIRA), test run3 using model3 (J-48)
and test run4 using model4 (Nave Bayes). Finally output
format preparation steps as mentioned in Section 4.4 were
performed for each of the output test runs and converted into
formats that are similar to the one speci ed in the training
annotation le.
6.</p>
    </sec>
    <sec id="sec-11">
      <title>RESULTS</title>
      <p>We have submitted four di erent runs using the approaches
discussed in the previous section. In this section, we discuss
about the performance of each of our submitted runs and
our overall performance in comparison to the other
participating teams. Standard Precision, Recall and F-Measure
parameters were used for evaluation. The values of these
metrics for the di erent runs that we submitted are shown
in Table 2.</p>
      <p>In run1, run2, run3 and run4 the correctly detected and
classi ed named entities are 11771, 8901, 11016 and 11122
respectively. We obtained a best F-measure of 41.15 for run2
using MIRA classi er which ranked third among all the runs
submitted by the participating teams. CRF (run1) and
J48 (run3) perform almost at par while Nave Bayes (run4)
performs the worst among the four classi ers.</p>
      <p>The training sample had a lot of words that were
nonNEes and thus that somehow may a ected the detection of
NEes in the test set as the entire training was done on the
training set. Some classes of NEes like Plants, Sday(Special
Day), Distance etc. had a very less number of words tagged
into them and thus their appearance in the classi ed outputs
were also fairly low. J-48 classi er worked quite in detection
of NEes but was unable to properly identify the appropriate
NE tags for an entity. Similar was the case with Nave Bayes
for example Tata's in Tata's narrow cars . . . was wrongly
classi ed as Person instead of Organization and iPhone in
I really want the iPhone 6s Rose Gold was misclassi ed as
Person instead of Artifact. This was mainly caused due
to the large disparity in number of the di erent NEes in
the training set and lack of proper features for proper
negraining. We also avoided the use of gazetteer lists in this
task which might otherwise have helped us in detection of
some special kinds of NEes. Another drawback of some of
our systems was that when NEes involving more than one
words were present, classi ers like J-48 and Nave Bayes
skipped some part of that entity for example if a named
entity like Mr. Narendra Modi was present in some tweet
text, these classi ers, in some instances, classi ed only Mr.
Narendra as a Person. This error was less in case of CRF
or MIRA as these classi ers are more suitable for sequence
labelling operations. There were some other instances when
a non-NE was misclassi ed as a NE.
7.</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS</title>
      <p>In this paper, we have presented a brief overview of our
machine learning based systems to address the automatic
NE identi cation problem on social media. We have
observed that the MIRA based approach provides better
results than the systems which are developed using CRF, DT,
NB classi ers. For our participation in ESM-IL subtask, we
have submitted four runs and the obtained results con rm
that the overall accuracy of Run2 is more than almost 3%
higher when compared to other runs, i.e. Run1. Run3 and
Run4.</p>
      <p>As future work, we would like to explore more
sophisticated features to handle NE tags and apply post-processing
heuristics to improve the performance of system. We also
plan to incorporate more language speci c feature in our
future work to improve the accuracy of the system.
8.</p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>We acknowledge the support of the Department of
Electronics and Information Technology (DeitY), Government of
India, through the project \CLIA System Phase III".</p>
      <p>The research work of the second last author was carried
out in the framework of WIQ-EI IRSES (Grant No. 269180)
within the FP 7 Marie Curie, DIANA-APPLICATIONS (
TIN2012-38603-C02-01) projects and the VLC/CAMPUS
Microcluster on Multimodal Interaction in Intelligent
Systems.
9.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ashwini</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Targetable named entity recognition in social media</article-title>
          .
          <source>arXiv preprint arXiv:1408.0782</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Naskar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          .
          <article-title>Bengali named entity recognition using margin infused relaxed algorithm</article-title>
          .
          <source>Text, Speech and Dialogue</source>
          , pages
          <volume>125</volume>
          {
          <fpage>132</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dewdney</surname>
          </string-name>
          .
          <article-title>Named entity trends originating from social media</article-title>
          .
          <source>In Workshop on Information Extraction and Entity Analytics on Social Media Data</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          16.
          <source>COLING</source>
          <year>2012</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>L. D.</surname>
          </string-name>
          et al.
          <article-title>Analysis of named entity recognition and linking for tweets</article-title>
          .
          <source>In Inf. Process. Manage</source>
          , pages
          <volume>32</volume>
          {
          <fpage>49</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Murnane</surname>
          </string-name>
          .
          <article-title>Improving accuracy of named entity recognition on social media data</article-title>
          .
          <source>Master Thesis</source>
          , University of Maryland,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Owoputi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Improved part-of-speech tagging for online conversational text with word clusters</article-title>
          .
          <source>In NAACL</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named entity recognition in tweets: An experimental study</article-title>
          .
          <source>In Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1524</fpage>
          {
          <fpage>1534</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chatterji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dantapat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <article-title>A hybrid approach for named entity recognition in indian languages</article-title>
          .
          <source>In NERSSEAL-IJCNLP-08</source>
          , pages
          <fpage>17</fpage>
          {
          <fpage>24</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>