=Paper=
{{Paper
|id=Vol-2036/T2-7
|storemode=property
|title=Tweet classification using Semantic Word-Embedding with Logistic Regression
|pdfUrl=https://ceur-ws.org/Vol-2036/T2-7.pdf
|volume=Vol-2036
|authors=Muhammad Rafi,Saeed Ahmed,Fawwad Ahmed,Fawzan Ahmed
|dblpUrl=https://dblp.org/rec/conf/fire/RafiAAA17
}}
==Tweet classification using Semantic Word-Embedding with Logistic Regression==
Tweet classification using semantic word-embedding with logistic regression Muhammad Raf Saeed Ahmed Fawwad Ahmed National University of Computer and National University of Computer and National University of Computer and Emerging Sciences, Karachi. Emerging Sciences, Karachi. Emerging Sciences, Karachi. muhammad.raf@nu.edu.pk k142142@nu.edu.pk k142051@nu.edu.pk Fawzan Ahmed National University of Computer and Emerging Sciences, Karachi. k142330@nu.edu.pk ABSTRACT 2. METHODOLOGY The paper presents a text classification approach for classifying Our approach is divided into three phases. In phase one, we tweets into two classes: availability/ need, based on the content of preprocessed the tweets. The dataset is preprocessed by the tweets. The approach uses a language model for classification performing parsing, stop-word removal and stemming using based on word-embedding of fixed length to get the semantic Porter algorithm. We have selected the textual features using term relationship among words. The approach uses logistic regression frequency inverse document frequency (tf*idf) weighting scheme. for actual classification. The logistic regression measures the For multi lingual text, we simply use translation mechanism, all relationship between the categorical dependent variable (tweet non -English tweets are translated using Google Translator API label) and a fixed length words embedding of the tweet- into English equivalent text. We also filter out theURLs and content(words), by estimating the probabilities of tweets produced Emojis text from the tweets. In the second phase, we have created by embedding words. The regression function is estimated by a fixed length word –embedding of each terms from the tweet maximum likelihood estimation of composition of tweets by these selected based on tf*idf scores. This phase adds semantic embedding words. The approach produced 84% accurate knowledge to the given instance of the tweet using a neural classification for the two classes on the training set provided for network model for word-embedding. This is particularly shared task on "Information Retrieval from Microblogs during meaningful for short text tweets and resolved the issues of Disasters (IRMiDis)". as a part of, The 9th meeting of Forum for sparsity, contextualization and representation. In the final phase, Information Retrieval Evaluation (FIRE 2017). we trained a logistic regression based classifier. The classification process through logistic regression measure the relationship Keywords between the categorical dependent variable (tweet label) and a Text classification, word embedding, logistic regression fixed length words embedding of the tweet-content(words), by estimating the probabilities of tweets produced by embedding 1. INTRODUCTION words. The regression function is estimated by maximum The proliferation of social media messaging sites enable users to likelihood estimation of composition of tweets by these get real-time information in case of disaster events. The effective embedding words. management of disaster relief operations very much dependent on identifying needs and availability of various resources like food, The training dataset contains 856-tweets from which 665 for medicine and shelters etc. Considering a large number of tweets availability and 191 for needs. We decided to build model from a during such event, demands to have an automatic way to sort them balance dataset took sampled a subset of data for training about out and effectively utilizing this information is growing concern 200 of availability and 191 needs. We split the data into two sets now. Twitter is a very popular microblogging platform and on for training and testing the model. The testing of the model was generates about 200 million tweets per day. Users post short text performed on 192. The accuracy of the model 74% from which 85 of 140 characters of length for communication and this text can be availabilities and 65 needs tweets were identified correctly. In viewed by user’s followers and can be searched via tweeter’s Table 1, we present the result evaluation of our training set. On search. The text classification for such short, often multi lingual average MAP for the training set is 0.1499 The testing set text is very challenging and posed a lot of problems [1]. A very comprises of 47K tweets. We labeled the tweets with our model. challenging problem is to classify the tweets by analyzing the The results for the test set is presented in Table 2. The average content in a scenario of a disaster like flood or earthquakes in term MAP is 0.0047 We observed that the output file that we submitted of whether the tweet is about the availability of a resource for for submitted run01, have missed 40298 and have missed 6702 relief tweets because we were not able to process these tweets because of emoji’s and URL text. Hence our MAP values for need tweets or there is some need of a particular resource at some place. The at precision@100 is coming to 0. shared task on "Information Retrieval from Microblogs during Disasters (IRMiDis)" [2]. as a part of, the 9th meeting of Forum Availability-Tweets Evaluation [4] Tang, Duyu, et al. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. ACL (1). Precision@100 Recall@1000 MAP 2014. 0.2900 0.1430 0.2165 [5] A. Genkin , D. D. Lewis and D. Madigan, Large-scale Need-Tweets Evaluation Bayesian logistic regression for text categorization. Technometrics, 49(3), 291-304. 2007. Precision@100 Recall@1000 MAP 0.1210 0.0456 0.0833 Average MAP 0.1499 Availability Tweets Example from the results Table 1: Results on training set Tweet-ID Text Classifier 592723044302528512 We all are with Nepal at this time 0.793487 Availability-Tweets Evaluation of tragedy Precision@100 Recall@1000 MAP 594215027038494723 sending items to the earthquakw 0.1400 0.0582 0.0082 victims We have some mask that 0.842821 Need-Tweets Evaluation we need to send to a company in The earthquake Precision@100 Recall@1000 MAP Table 3: Examples of tweets from Availability category 0.0000 0.0375 0.0011 Average MAP 0.0047 Table 2: Results on test set Availability Tweets Example from the results Some examples from the classification task are presented in Table 3 and Table 4. Tweet-ID Text Classifier 595022733156618240 Nepal #news Still Needs Five 0.907212 3. CONCLUSION AND FUTURE WORK Lacs: The Cooperative We propose a simple, enrichment based, scalable approach for Development Ministry classification of short tweet text into two classes availability/need. distributes about 5 million It is worth mentioning that our approach complements the research on enriching short text representation with word 592955066622939136 I disagree with VHP leaders The 0.774131 embedding based semantic vectors. The proposed approach has a world knows Rahul Gandhi is great potential to achieve much better results with more research capable of nothing let alone on (i) term weighting scheme or smoothing (ii) feature selection earthquake and (iii) classification methods. Although our initial investigation and experiments are not able to produced exceptional results, we Table 4: Examples of tweets from needs category are confident that there are several direction of improvement on our results. 4. ACKNOWLEDGMENTS Our thanks to IRMiDis Track organizer for providing us an opportunity to work on this interesting problem. We also like to thank computer science department of NUCES FAST Karachi campus. 5. REFERENCES [1] R. Batool, A. M. Khattak, J. Maqbool and S. Lee. Precise tweet classification and sentiment analysis. 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), Niigata, Japan 2013 [2] M. Basu, S. Ghosh, K. Ghosh and M. Choudhury. Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis). In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India, December 8-10, 2017, CEUR Workshop Proceedings. CEUR-WS.org, 2017. [3] Vo, Duy-Tin, and Yue Zhang. "Target-Dependent Twitter Sentiment Classification with Rich Automatic Features." IJCAI. 2015.