=Paper=
{{Paper
|id=Vol-2036/T2-3
|storemode=property
|title=BITS_PILANI@IMRiDis-FIRE 2017:Information Retrieval from Microblog during Disasters
|pdfUrl=https://ceur-ws.org/Vol-2036/T2-3.pdf
|volume=Vol-2036
|authors=Arka Talukdar,Rupal Bhargava,Yashvardhan Sharma
|dblpUrl=https://dblp.org/rec/conf/fire/TalukdarBS17
}}
==BITS_PILANI@IMRiDis-FIRE 2017:Information Retrieval from Microblog during Disasters==
BITS_PILANI@IMRiDis-FIRE 2017:Information Retrieval from Microblogs during Disasters Arka Talukdar1 Rupal Bhargava2 Yashvardhan Sharma3 WiSoc Lab, Department of Computer Science Birla Institute of Technology and Science, Pilani Campus Pilani-333031 {f2015112 , rupal.bhargava2, yash3} @pilani.bits-pilani.ac.in 1 ABSTRACT (2) Availability-tweets: Tweets which inform about the availability of some specific resources. This class includes both Microblogging sites like Twitter are increasingly being used for tweets which inform about potential availability, such as aiding relief operations during disaster events. In such situations, resources being transported or dispatched to the disaster-struck identifying actionable information like needs and availabilities of area. various types of resources is critical for effective coordination of post disaster relief operations. However, such critical We used word embeddings to represent tweets and then information is usually submerged within a lot of conversational fastText[3] classification algorithm to classify the tweet to its content, such as sympathy for the victims of the disaster. Hence, appropriate category. Our system has performed considerably automated IR techniques are needed to find and process such well given its robustness and low resource utilization. information. In this paper, we utilize word vector embeddings along with fastText sentence classification algorithm to perform the task of classification of tweets posted during natural 2 BACKGROUND / RELATED WORK disasters. Classification of tweets has been tackled in many shapes and forms over the years. Overtime, we’ve seen a shift from one-hot CCS CONCEPTS vectors representing words to more dense vectors based on word embeddings. Yang et.al, for example, show how leveraging word • Information Retrieval →Clustering and Classification; embeddings can improve classification of tweets to predict election results [7]. Crisis response, in particular, has been KEYWORDS tackled leveraging twitter data as well. Imran et al, focuses on building a strong word2vec model based on crisis response Word embedding, sentence classification, fastText, twitter, tweets and leverages basic linear regression models[2]. Most multilingual text classification notably, Zhou et al showcase that c-LSTMs, a hybrid approach between CNNs and LSTMs showed significantly improved 1 INTRODUCTION results over traditional models when classifying text [8][9]. Using Neural Networks is not the ideal solution due to its high This paper describes our approach for the Microblog Track resource requirement. FastText, on the other hand, gives nearly in FIRE 2017.[1] Microblogging sites like Twitter are important the same performance at a fraction of the resources. sources of real-time information, and thus can be utilized for extracting significant information at times of disasters such as floods, earthquakes, cyclones, etc. The aim of 3 DATA the Microblog track at FIRE 2017 was to develop IR systems to retrieve important information from microblogs posted at the The training data was a collection of about 20,000 tweets posted time of disasters. The task involved identifying tweets to develop during the Nepal earthquake in April 2015, along with the automatic methodologies for identifying need-tweets and associated metadata for each tweet.[1] availability-tweets. The major challenge with the data was that the tweets involved Two classes that were to be identified were defined as: were multilingual and code-mixed. The tweets were in English, (1) Need-tweets: Tweets which inform about the need or Hindi, and Nepali and there were very few training examples in requirement of some specific resource such as food, water, Hindi and Nepali which made it difficult to train. Apart from it, medical aid, shelter, mobile or Internet connectivity, etc. another issue that was faced was the imbalance in class proportions in training data with only a small portion belonging to positive. Tweets have a stringent word limit, and users often make use of Class Precision@100 Recall@1000 MAP innovative abbreviations which are difficult to handle for retrieval systems. Besides, they are mostly informal and may Need involve the use of multiple languages in the same tweet (called 0.84 0.3466 0.2903 code mixing), or even multiple scripts in a tweet. It is also difficult to make sense of emoticons, and informal shorthands Availability 0.52 0.25 0.1244 especially innovative ones made up by users. BITS_PILANI_RUN2: CBOW word embeddings: 4 PROPOSED TECHNIQUE The problem is formulated as a classification task and the Class Precision@100 Recall@1000 MAP objective is to learn a classifier. The proposed methodology Need involves a pipelined approach and is divided into four phases: 0.84 0.281 0.2362 ● Pre-processing of Tweet Corpus Availability 0.62 0.2459 0.1625 ● Creating word-embeddings ● Training classifier ● Calibrate Results with Platt scaling 4.1 Preprocessing The tweet texts were extracted from associated metadata and were pre-processed in order to ensure uniformity. Pre-processing included removal of emoticon special characters, numbers, hashtags punctuation and words which were not present in Roman and Devnagri script and converting all Roman characters to lowercase. 4.2 Creating Word Embeddings Created word embeddings using skip-gram and cbow algorithm with the training tweets as the corpus. Best results were obtained with window of size 5 in case of both the algorithms. To create the word-vectors we used google’s word2Vec library support. Both the runs yielded similar results with skip-gram performing marginally better than CBOW. The runs achieved highest 4.3 Training the model precision@100 among all submissions while it gave reasonable in recall@1000. The fastText classifier is trained on the labeled data and the previously created word embeddings. FastText is optimised and 6 CONCLUSION learning rate and other hyper parameters are tuned using grid search.[4] FastText creates sentence vectors from the individual Information available on social media platform, like twitter, word vectors in the words of the tweets. The fastText algorithm during an emergency situation proved to be immensely useful uses this sentence vectors to classify tweets. for crisis response and management. However, analyzing large amounts of social media data pose serious challenges to crisis 4.4 Calibrating the model managers, especially under time-critical situations. In this paper, The fastText classifier tends to push probabilities to the extremes, we presented a method that trains very fast and at low resource such a model is not well calibrated. To calibrate the model and to effectively monitor social media big crisis data in a timely ensure even distribution of probability, Platt scaling is applied.[8] manner. The final scaled probability of tweet was used to rank the tweets The proposed model can be improved significantly if it’s recall is in each category. improved. A major cause of poor recall was the imbalance in the dataset, data augmentation techniques may resolve this issue, 5. EVALUATION RESULTS this can be extended as future work. The results of both the runs have been summarized in the following tables: REFERENCES [1]M. Basu, S. Ghosh, K. Ghosh and M. Choudhury. Overview of BITS_PILANI_RUN1: Skip-Gram word embeddings: the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis). In Working notes of FIRE 2017 - 2 Forum for Information Retrieval Evaluation, Bangalore, India, December 8-10, 2017, CEUR Workshop Proceedings. CEUR- WS.org, 2017 [2] Imran, Muhammad, Prasenjit Mitra, and Carlos Castillo. "Twitter as a lifeline: Human annotated twitter corpora for NLP of crisis-related messages." arXiv preprint arXiv:1605.05894 (2016). [3]Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. [4]Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. [5] M. Imran, C. Castillo, J. Lucas, P. Meier, and S. Vieweg, “AIDR: Artificial intelligence for disaster response,” in Proceedings of the companion publication of the 23rd international conference on World wide web companion. International World Wide Web Conferences Steering Committee, 2014, pp. 159–162. [6] "Tweepy." Tweepy. N.p., n.d. Web. 22 Mar. 2017. [7] Yang, Xiao, Craig Macdonald, and Iadh Ounis. "Using word embeddings in twitter election classification." arXiv preprint arXiv:1606.07006 (2016). [8]Hsuan-Tien Lin, Chih-Jen Lin, Ruby C. Weng, A note on Platt's probabilistic outputs for support vector machines, Machine Learning, v.68 n.3, p.267-276, October 2007 [9]CS224N Final Project: Detecting Key Needs in Crisis. Tulsee Doshi (tdoshi), Emma Marriott (emarriott), Jay Patel (jayhp9). March 22, 2017 3