1. INTRODUCTION

Tweet classification using semantic word-embedding with logistic regression

Emerging Sciences

Karachi.

Emerging Sciences

Karachi.

Fawwad Ahmed

Emerging Sciences

Karachi.

Emerging Sciences

Karachi.

0 0 National University of Computer

The paper presents a text classification approach for classifying tweets into two classes: availability/ need, based on the content of the tweets. The approach uses a language model for classification based on word-embedding of fixed length to get the semantic relationship among words. The approach uses logistic regression for actual classification. The logistic regression measures the relationship between the categorical dependent variable (tweet label) and a fixed length words embedding of the tweetcontent(words), by estimating the probabilities of tweets produced by embedding words. The regression function is estimated by maximum likelihood estimation of composition of tweets by these embedding words. The approach produced 84% accurate classification for the two classes on the training set provided for shared task on "Information Retrieval from Microblogs during Disasters (IRMiDis)". as a part of, The 9th meeting of Forum for Information Retrieval Evaluation (FIRE 2017).

Text classification word embedding logistic regression

1. INTRODUCTION

The proliferation of social media messaging sites enable users to get real-time information in case of disaster events. The effective management of disaster relief operations very much dependent on identifying needs and availability of various resources like food, medicine and shelters etc. Considering a large number of tweets during such event, demands to have an automatic way to sort them out and effectively utilizing this information is growing concern now. Twitter is a very popular microblogging platform and generates about 200 million tweets per day. Users post short text of 140 characters of length for communication and this text can be viewed by user’s followers and can be searched via tweeter’s search. The text classification for such short, often multi lingual text is very challenging and posed a lot of problems [ 1 ]. A very challenging problem is to classify the tweets by analyzing the content in a scenario of a disaster like flood or earthquakes in term of whether the tweet is about the availability of a resource for relief or there is some need of a particular resource at some place. The shared task on "Information Retrieval from Microblogs during Disasters (IRMiDis)" [ 2 ]. as a part of, the 9th meeting of Forum 2.

METHODOLOGY

Our approach is divided into three phases. In phase one, we preprocessed the tweets. The dataset is preprocessed by performing parsing, stop-word removal and stemming using Porter algorithm. We have selected the textual features using term frequency inverse document frequency (tf*idf) weighting scheme. For multi lingual text, we simply use translation mechanism, all non -English tweets are translated using Google Translator API into English equivalent text. We also filter out theURLs and Emojis text from the tweets. In the second phase, we have created a fixed length word –embedding of each terms from the tweet selected based on tf*idf scores. This phase adds semantic knowledge to the given instance of the tweet using a neural network model for word-embedding. This is particularly meaningful for short text tweets and resolved the issues of sparsity, contextualization and representation. In the final phase, we trained a logistic regression based classifier. The classification process through logistic regression measure the relationship between the categorical dependent variable (tweet label) and a fixed length words embedding of the tweet-content(words), by estimating the probabilities of tweets produced by embedding words. The regression function is estimated by maximum likelihood estimation of composition of tweets by these embedding words.

The training dataset contains 856-tweets from which 665 for availability and 191 for needs. We decided to build model from a balance dataset took sampled a subset of data for training about 200 of availability and 191 needs. We split the data into two sets on for training and testing the model. The testing of the model was performed on 192. The accuracy of the model 74% from which 85 availabilities and 65 needs tweets were identified correctly. In Table 1, we present the result evaluation of our training set. On average MAP for the training set is 0.1499 The testing set comprises of 47K tweets. We labeled the tweets with our model. The results for the test set is presented in Table 2. The average MAP is 0.0047 We observed that the output file that we submitted for submitted run01, have missed 40298 and have missed 6702 tweets because we were not able to process these tweets because of emoji’s and URL text. Hence our MAP values for need tweets at precision@100 is coming to 0.

Precision@100 0.2900 Precision@100 0.1210 Average MAP Need-Tweets Evaluation Recall@1000 0.1430 Recall@1000 0.0456 Availability-Tweets Evaluation Precision@100 0.1400 Precision@100 0.0000 Average MAP Need-Tweets Evaluation Recall@1000 0.0582 Recall@1000 0.0375

0.2165 0.0833 0.1499 0.0082 MAP 0.0011 0.0047 Some examples from the classification task are presented in Table 3 and Table 4.

3. CONCLUSION AND FUTURE WORK We propose a simple, enrichment based, scalable approach for classification of short tweet text into two classes availability/need. It is worth mentioning that our approach complements the research on enriching short text representation with word embedding based semantic vectors. The proposed approach has a great potential to achieve much better results with more research on (i) term weighting scheme or smoothing (ii) feature selection and (iii) classification methods. Although our initial investigation and experiments are not able to produced exceptional results, we are confident that there are several direction of improvement on our results.

4. ACKNOWLEDGMENTS Our thanks to IRMiDis Track organizer for providing us an opportunity to work on this interesting problem. We also like to thank computer science department of NUCES FAST Karachi campus.

Tweet-ID

592723044302528512 594215027038494723

Availability Tweets Example from the results Tweet-ID

595022733156618240 592955066622939136

We all are with Nepal at this time

of tragedy sending items to the earthquakw victims We have some mask that we need to send to a company in

The earthquake Nepal #news Still Needs Five Lacs: The Cooperative Development Ministry

distributes about 5 million

I disagree with VHP leaders The world knows Rahul Gandhi is capable of nothing let alone earthquake Classifier

0.793487 0.842821

Classifier

0.907212 0.774131

[1]

Batool ,

A. M.

Khattak ,

Maqbool and

Lee . Precise tweet classification and sentiment analysis . 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS) , Niigata, Japan 2013

[2]

Basu ,

Ghosh ,

Ghosh and

Choudhury . Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis) . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, December 8- 10 , 2017 ,

CEUR

Workshop Proceedings . CEUR-WS.org, 2017 .

[3] Vo , Duy-Tin , and Yue Zhang . "Target-Dependent Twitter Sentiment Classification with Rich Automatic Features." IJCAI . 2015 .

[4] Tang , Duyu , et al. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification . ACL (1) . 2014 .

[5]

Genkin ,

D. D.

Lewis and

Madigan , Large-scale Bayesian logistic regression for text categorization . Technometrics , 49 ( 3 ), 291 - 304 . 2007 .