=Paper= {{Paper |id=Vol-2036/T2-7 |storemode=property |title=Tweet classification using Semantic Word-Embedding with Logistic Regression |pdfUrl=https://ceur-ws.org/Vol-2036/T2-7.pdf |volume=Vol-2036 |authors=Muhammad Rafi,Saeed Ahmed,Fawwad Ahmed,Fawzan Ahmed |dblpUrl=https://dblp.org/rec/conf/fire/RafiAAA17 }} ==Tweet classification using Semantic Word-Embedding with Logistic Regression== https://ceur-ws.org/Vol-2036/T2-7.pdf
 Tweet classification using semantic word-embedding with
                    logistic regression
           Muhammad Raf                                      Saeed Ahmed                                Fawwad Ahmed
 National University of Computer and             National University of Computer and          National University of Computer and
    Emerging Sciences, Karachi.                     Emerging Sciences, Karachi.                  Emerging Sciences, Karachi.
     muhammad.raf@nu.edu.pk                             k142142@nu.edu.pk                            k142051@nu.edu.pk


                                                            Fawzan Ahmed
                                                 National University of Computer and
                                                    Emerging Sciences, Karachi.
                                                        k142330@nu.edu.pk


ABSTRACT                                                                     2. METHODOLOGY
The paper presents a text classification approach for classifying       Our approach is divided into three phases. In phase one, we
tweets into two classes: availability/ need, based on the content of    preprocessed the tweets. The dataset is preprocessed by
the tweets. The approach uses a language model for classification       performing parsing, stop-word removal and stemming using
based on word-embedding of fixed length to get the semantic             Porter algorithm. We have selected the textual features using term
relationship among words. The approach uses logistic regression         frequency inverse document frequency (tf*idf) weighting scheme.
for actual classification. The logistic regression measures the         For multi lingual text, we simply use translation mechanism, all
relationship between the categorical dependent variable (tweet          non -English tweets are translated using Google Translator API
label) and a fixed length words embedding of the tweet-                 into English equivalent text. We also filter out theURLs and
content(words), by estimating the probabilities of tweets produced      Emojis text from the tweets. In the second phase, we have created
by embedding words. The regression function is estimated by             a fixed length word –embedding of each terms from the tweet
maximum likelihood estimation of composition of tweets by these         selected based on tf*idf scores. This phase adds semantic
embedding words. The approach produced 84% accurate                     knowledge to the given instance of the tweet using a neural
classification for the two classes on the training set provided for     network model for word-embedding. This is particularly
shared task on "Information Retrieval from Microblogs during            meaningful for short text tweets and resolved the issues of
Disasters (IRMiDis)". as a part of, The 9th meeting of Forum for        sparsity, contextualization and representation. In the final phase,
Information Retrieval Evaluation (FIRE 2017).                           we trained a logistic regression based classifier. The classification
                                                                        process through logistic regression measure the relationship
Keywords                                                                between the categorical dependent variable (tweet label) and a
Text classification, word embedding, logistic regression                fixed length words embedding of the tweet-content(words), by
                                                                        estimating the probabilities of tweets produced by embedding
     1. INTRODUCTION                                                    words. The regression function is estimated by maximum
The proliferation of social media messaging sites enable users to       likelihood estimation of composition of tweets by these
get real-time information in case of disaster events. The effective     embedding words.
management of disaster relief operations very much dependent on
identifying needs and availability of various resources like food,      The training dataset contains 856-tweets from which 665 for
medicine and shelters etc. Considering a large number of tweets         availability and 191 for needs. We decided to build model from a
during such event, demands to have an automatic way to sort them        balance dataset took sampled a subset of data for training about
out and effectively utilizing this information is growing concern       200 of availability and 191 needs. We split the data into two sets
now. Twitter is a very popular microblogging platform and               on for training and testing the model. The testing of the model was
generates about 200 million tweets per day. Users post short text       performed on 192. The accuracy of the model 74% from which 85
of 140 characters of length for communication and this text can be      availabilities and 65 needs tweets were identified correctly. In
viewed by user’s followers and can be searched via tweeter’s            Table 1, we present the result evaluation of our training set. On
search. The text classification for such short, often multi lingual     average MAP for the training set is 0.1499 The testing set
text is very challenging and posed a lot of problems [1]. A very        comprises of 47K tweets. We labeled the tweets with our model.
challenging problem is to classify the tweets by analyzing the          The results for the test set is presented in Table 2. The average
content in a scenario of a disaster like flood or earthquakes in term   MAP is 0.0047 We observed that the output file that we submitted
of whether the tweet is about the availability of a resource for        for submitted run01, have missed 40298 and have missed 6702
relief                                                                  tweets because we were not able to process these tweets because
                                                                        of emoji’s and URL text. Hence our MAP values for need tweets
or there is some need of a particular resource at some place. The       at precision@100 is coming to 0.
shared task on "Information Retrieval from Microblogs during
Disasters (IRMiDis)" [2]. as a part of, the 9th meeting of Forum
Availability-Tweets Evaluation                                           [4] Tang, Duyu, et al. Learning Sentiment-Specific Word
                                                                             Embedding for Twitter Sentiment Classification. ACL (1).
Precision@100           Recall@1000           MAP                            2014.
0.2900                  0.1430                0.2165                     [5] A. Genkin , D. D. Lewis and D. Madigan, Large-scale
Need-Tweets Evaluation                                                       Bayesian logistic regression for text categorization.
                                                                             Technometrics, 49(3), 291-304. 2007.
Precision@100           Recall@1000           MAP
0.1210                  0.0456                0.0833
Average MAP                                   0.1499
                                                                         Availability Tweets Example from the results
                 Table 1: Results on training set
                                                                         Tweet-ID                Text                                   Classifier
                                                                         592723044302528512      We all are with Nepal at this time     0.793487
Availability-Tweets Evaluation
                                                                                                 of tragedy
Precision@100           Recall@1000           MAP
                                                                         594215027038494723      sending items to the earthquakw
0.1400                  0.0582                0.0082                                             victims We have some mask that         0.842821
Need-Tweets Evaluation                                                                           we need to send to a company in
                                                                                                 The earthquake
Precision@100           Recall@1000           MAP
                                                                            Table 3: Examples of tweets from Availability category
0.0000                  0.0375                0.0011
Average MAP                                   0.0047
                    Table 2: Results on test set
                                                                         Availability Tweets Example from the results
Some examples from the classification task are presented in Table
3 and Table 4.                                                           Tweet-ID                  Text                                Classifier
                                                                         595022733156618240        Nepal #news Still Needs Five        0.907212
     3. CONCLUSION AND FUTURE WORK                                                                 Lacs:      The      Cooperative
We propose a simple, enrichment based, scalable approach for                                       Development              Ministry
classification of short tweet text into two classes availability/need.                             distributes about 5 million
It is worth mentioning that our approach complements the
research on enriching short text representation with word                592955066622939136        I disagree with VHP leaders The     0.774131
embedding based semantic vectors. The proposed approach has a                                      world knows Rahul Gandhi is
great potential to achieve much better results with more research                                  capable of nothing let alone
on (i) term weighting scheme or smoothing (ii) feature selection                                   earthquake
and (iii) classification methods. Although our initial investigation
and experiments are not able to produced exceptional results, we                Table 4: Examples of tweets from needs category
are confident that there are several direction of improvement on
our results.

     4. ACKNOWLEDGMENTS
Our thanks to IRMiDis Track organizer for providing us an
opportunity to work on this interesting problem. We also like to
thank computer science department of NUCES FAST Karachi
campus.

     5. REFERENCES
[1] R. Batool, A. M. Khattak, J. Maqbool and S. Lee. Precise
    tweet classification and sentiment analysis. 2013
    IEEE/ACIS 12th International Conference on Computer and
    Information Science (ICIS), Niigata, Japan 2013
[2] M. Basu, S. Ghosh, K. Ghosh and M. Choudhury. Overview
    of the FIRE 2017 track: Information Retrieval from
    Microblogs during Disasters (IRMiDis). In Working notes of
    FIRE 2017 - Forum for Information Retrieval Evaluation,
    Bangalore, India, December 8-10, 2017, CEUR Workshop
    Proceedings. CEUR-WS.org, 2017.
[3] Vo, Duy-Tin, and Yue Zhang. "Target-Dependent Twitter
    Sentiment Classification with Rich Automatic Features."
    IJCAI. 2015.