<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tweet classification using semantic word-embedding with logistic regression</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emerging Sciences</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karachi.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emerging Sciences</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karachi.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fawwad Ahmed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emerging Sciences</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karachi.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emerging Sciences</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karachi.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Computer</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents a text classification approach for classifying tweets into two classes: availability/ need, based on the content of the tweets. The approach uses a language model for classification based on word-embedding of fixed length to get the semantic relationship among words. The approach uses logistic regression for actual classification. The logistic regression measures the relationship between the categorical dependent variable (tweet label) and a fixed length words embedding of the tweetcontent(words), by estimating the probabilities of tweets produced by embedding words. The regression function is estimated by maximum likelihood estimation of composition of tweets by these embedding words. The approach produced 84% accurate classification for the two classes on the training set provided for shared task on "Information Retrieval from Microblogs during Disasters (IRMiDis)". as a part of, The 9th meeting of Forum for Information Retrieval Evaluation (FIRE 2017).</p>
      </abstract>
      <kwd-group>
        <kwd>Text classification</kwd>
        <kwd>word embedding</kwd>
        <kwd>logistic regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The proliferation of social media messaging sites enable users to
get real-time information in case of disaster events. The effective
management of disaster relief operations very much dependent on
identifying needs and availability of various resources like food,
medicine and shelters etc. Considering a large number of tweets
during such event, demands to have an automatic way to sort them
out and effectively utilizing this information is growing concern
now. Twitter is a very popular microblogging platform and
generates about 200 million tweets per day. Users post short text
of 140 characters of length for communication and this text can be
viewed by user’s followers and can be searched via tweeter’s
search. The text classification for such short, often multi lingual
text is very challenging and posed a lot of problems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A very
challenging problem is to classify the tweets by analyzing the
content in a scenario of a disaster like flood or earthquakes in term
of whether the tweet is about the availability of a resource for
relief
or there is some need of a particular resource at some place. The
shared task on "Information Retrieval from Microblogs during
Disasters (IRMiDis)" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. as a part of, the 9th meeting of Forum
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>Our approach is divided into three phases. In phase one, we
preprocessed the tweets. The dataset is preprocessed by
performing parsing, stop-word removal and stemming using
Porter algorithm. We have selected the textual features using term
frequency inverse document frequency (tf*idf) weighting scheme.
For multi lingual text, we simply use translation mechanism, all
non -English tweets are translated using Google Translator API
into English equivalent text. We also filter out theURLs and
Emojis text from the tweets. In the second phase, we have created
a fixed length word –embedding of each terms from the tweet
selected based on tf*idf scores. This phase adds semantic
knowledge to the given instance of the tweet using a neural
network model for word-embedding. This is particularly
meaningful for short text tweets and resolved the issues of
sparsity, contextualization and representation. In the final phase,
we trained a logistic regression based classifier. The classification
process through logistic regression measure the relationship
between the categorical dependent variable (tweet label) and a
fixed length words embedding of the tweet-content(words), by
estimating the probabilities of tweets produced by embedding
words. The regression function is estimated by maximum
likelihood estimation of composition of tweets by these
embedding words.</p>
      <p>The training dataset contains 856-tweets from which 665 for
availability and 191 for needs. We decided to build model from a
balance dataset took sampled a subset of data for training about
200 of availability and 191 needs. We split the data into two sets
on for training and testing the model. The testing of the model was
performed on 192. The accuracy of the model 74% from which 85
availabilities and 65 needs tweets were identified correctly. In
Table 1, we present the result evaluation of our training set. On
average MAP for the training set is 0.1499 The testing set
comprises of 47K tweets. We labeled the tweets with our model.
The results for the test set is presented in Table 2. The average
MAP is 0.0047 We observed that the output file that we submitted
for submitted run01, have missed 40298 and have missed 6702
tweets because we were not able to process these tweets because
of emoji’s and URL text. Hence our MAP values for need tweets
at precision@100 is coming to 0.</p>
      <sec id="sec-2-1">
        <title>Precision@100 0.2900</title>
      </sec>
      <sec id="sec-2-2">
        <title>Precision@100 0.1210</title>
      </sec>
      <sec id="sec-2-3">
        <title>Average MAP</title>
      </sec>
      <sec id="sec-2-4">
        <title>Need-Tweets Evaluation</title>
      </sec>
      <sec id="sec-2-5">
        <title>Recall@1000 0.1430</title>
      </sec>
      <sec id="sec-2-6">
        <title>Recall@1000 0.0456</title>
      </sec>
      <sec id="sec-2-7">
        <title>Availability-Tweets Evaluation</title>
      </sec>
      <sec id="sec-2-8">
        <title>Precision@100 0.1400</title>
      </sec>
      <sec id="sec-2-9">
        <title>Precision@100 0.0000</title>
      </sec>
      <sec id="sec-2-10">
        <title>Average MAP</title>
      </sec>
      <sec id="sec-2-11">
        <title>Need-Tweets Evaluation</title>
      </sec>
      <sec id="sec-2-12">
        <title>Recall@1000 0.0582 Recall@1000 0.0375</title>
        <p>0.2165
0.0833
0.1499
0.0082
MAP
0.0011
0.0047
Some examples from the classification task are presented in Table
3 and Table 4.</p>
        <p>3. CONCLUSION AND FUTURE WORK
We propose a simple, enrichment based, scalable approach for
classification of short tweet text into two classes availability/need.
It is worth mentioning that our approach complements the
research on enriching short text representation with word
embedding based semantic vectors. The proposed approach has a
great potential to achieve much better results with more research
on (i) term weighting scheme or smoothing (ii) feature selection
and (iii) classification methods. Although our initial investigation
and experiments are not able to produced exceptional results, we
are confident that there are several direction of improvement on
our results.</p>
        <p>4. ACKNOWLEDGMENTS
Our thanks to IRMiDis Track organizer for providing us an
opportunity to work on this interesting problem. We also like to
thank computer science department of NUCES FAST Karachi
campus.</p>
        <sec id="sec-2-12-1">
          <title>Tweet-ID</title>
          <p>592723044302528512
594215027038494723</p>
        </sec>
        <sec id="sec-2-12-2">
          <title>Availability Tweets Example from the results</title>
        </sec>
        <sec id="sec-2-12-3">
          <title>Tweet-ID</title>
          <p>595022733156618240
592955066622939136</p>
        </sec>
      </sec>
      <sec id="sec-2-13">
        <title>We all are with Nepal at this time</title>
        <p>of tragedy
sending items to the earthquakw
victims We have some mask that
we need to send to a company in</p>
      </sec>
      <sec id="sec-2-14">
        <title>The earthquake</title>
      </sec>
      <sec id="sec-2-15">
        <title>Nepal #news Still Needs Five</title>
      </sec>
      <sec id="sec-2-16">
        <title>Lacs: The Cooperative</title>
      </sec>
      <sec id="sec-2-17">
        <title>Development Ministry</title>
        <p>distributes about 5 million</p>
      </sec>
      <sec id="sec-2-18">
        <title>I disagree with VHP leaders The world knows Rahul Gandhi is capable of nothing let alone earthquake</title>
        <sec id="sec-2-18-1">
          <title>Classifier</title>
          <p>0.793487
0.842821</p>
        </sec>
        <sec id="sec-2-18-2">
          <title>Classifier</title>
          <p>0.907212
0.774131</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Batool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Khattak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maqbool</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Precise tweet classification and sentiment analysis</article-title>
          .
          <source>2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)</source>
          , Niigata, Japan 2013
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          .
          <article-title>Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis)</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India, December 8-
          <issue>10</issue>
          ,
          <year>2017</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <surname>Duy-Tin</surname>
            ,
            <given-names>and Yue</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>"Target-Dependent Twitter Sentiment Classification with Rich Automatic Features."</article-title>
          <source>IJCAI</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Duyu</surname>
          </string-name>
          , et al.
          <article-title>Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification</article-title>
          .
          <source>ACL (1)</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Genkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Madigan</surname>
          </string-name>
          ,
          <article-title>Large-scale Bayesian logistic regression for text categorization</article-title>
          .
          <source>Technometrics</source>
          ,
          <volume>49</volume>
          (
          <issue>3</issue>
          ),
          <fpage>291</fpage>
          -
          <lpage>304</lpage>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>