DataBros@Information Retrieval from Microblogs during Disasters(IRMiDis) Naveen Kumar Mradul Dubey IIIT Kalyani IIIT Kalyani A10/137, IIIT Kalyani Boys Hostel A10/137, IIIT Kalyani Boys Hostel Kalyani, West Bengal 741235 Kalyani, West Bengal 741235 naveen.pwn@gmail.com mraduldubey@iiitkalyni.ac.in ABSTRACT 1.3 Sub-task 1: Identifying need-tweets and Microblogging sites like Twitter are increasingly being used for availability-tweets aiding relief operations during disaster events. In such situations, Here the participants need to develop automatic methodologies for identifying actionable information like needs and availabilities of identifying need-tweets and availability-tweets. This is mainly a various types of resources is critical for effective coordination of search problem, where relevant microblogs have to be retrieved. post disaster relief operations. However, such critical information However, apart from search, the problem of identifying need-tweets is usually submerged within a lot of conversational content, such and availability-tweets can also be viewed as a pattern matching as sympathy for the victims of the disaster. Hence, automated IR problem, or a classification problem (e.g., where tweets are clas- techniques are needed to find and process such information.[1] sified into three classes- need-tweets, availability-tweets, and oth- ers). CCS CONCEPTS •Data Science →Machine Learning; NLP; Tweet Extraction; KEYWORDS 1.4 Sub-task 1: Matching need-tweets and Data Mining, NLP, Machine Learning availability-tweets An availability-tweet is said to match a need-tweet, if the availability- tweet informs about the availability of at least one resource whose need is indicated in the need-tweet. Table 1 shows some examples of need-tweets and matching availability-tweets. In this sub-task, 1 INTRODUCTION the participants are required to develop methodologies for match- In this track, focus is on two types of tweets: ing need-tweets with appropriate availability-tweets. Note that an availability-tweet is considered to match a need- tweet even if 1.1 Need-tweets: there is a partial match of the resources, e.g. if the need-tweet mentions about multiple resources and the availability-tweet in- Tweets which inform about the need or requirement of some spe- form the availability of a subset of these resources. Also, note that cific resource such as food, water, medical aid, shelter, mobile or a need-tweet and a matching availability-tweet can be in different Internet connectivity, etc. Note that tweets which do not directly languages; either or both might be code-switched as well. specify the need, but point to scarcity or non-availability of some resources (i.e., a covert expression of the need) are also included in this category. For instance, the tweet ”Mobile phones not work- 2 METHODOLOGIES ing” is considered as a need-tweet, since it informs about the need 2.1 Dataset & Preprocessing for mobile connectivity. The python code which was provided for us was used to crawl both train data and test data. The twitter data was crawled in json 1.2 Availability-tweets: format. It was then converted into a csv file by taking tweet-id, Tweets which inform about the availability of some specific re- text, and its class as attribute. Classes were assumed as 0 for non- sources. This class includes both tweets which inform about poten- relevant tweets, 1 for need-tweets, and 2 for availability-tweets. tial availability, such as resources being transported or dispatched to the disaster-struck area, as well as tweets informing about the All characters other than alphabets were removed from our tweets actual availability in the disaster-struck area, such as food being and converted them in small letters. Stopwords were also removed distributed, etc. Note that a particular tweet may be both a need- and then stemming was done so that similar words with different tweet and an availability-tweet if it informs about the need of some verb forms could be treated as same. After that, most common specific resource, as well as the availability of some other resource. words among the tweets were found. They were removed except some selected words which are [medical, need, give, relief, fund, food, donate, aid, water, meal, send, offer, finance, blood]. Also all The track will have two sub-tasks, as described below: the retweets and redundant tweets were removed. 2.2 Model (1) First model was a simple Bag-Of-Words (BOW) model. It selects the features from the tweets as vocabulary and keeps most important features at the top. It gave a good result but it was not enough. (2) TfidfVectorizer was used to collect the features. It included the unigrams and bigrams. Limit to max features extrac- tion was kept to 6000. After that Recursive Feature Elimi- nation (RFE)[3] was used with LinearSVM as estimator to select 1000 most informative features. The main purpose of SVM-RFE is to compute the ranking weights for all fea- 4 FUTURE WORK tures and sort the features according to weight vectors as Lemmatization can be used in place of Stemming which will give the classification basis. SVM-RFE is an iteration process more accurate context. Also, PCA can be used in place of RFE. of the backward removal of features. Its steps for feature Future work can include using spellchecker and correcting it, us- set selection are shown as follows: ing wordnet for getting more accurate features and will make our (a) Use the current dataset to train the classifier. model more-flexible. (b) Compute the ranking weights for all features. (c) Delete the feature with the smallest weight. Task-2 can be much improved by finding more accurately what Implement the iteration process until there is only one is needed and also in which exact location. Need-tweet can be feature remaining in the dataset; the implementation re- matched with that availability-tweet which can fulfill most of its sult provides a list of features in the order of weight. The demands or if the quantity is given of the needed thing, it can algorithm will remove the feature with smallest ranking be matched to that availability-tweet that has that amount of the weight, while retaining the feature variables of significant needed thing. Also, it will be helpful to match the tweet with that impact. Finally, the feature variables will be listed in the tweet which is more geographically closer. Language barrier can descending order of explanatory difference degree. SVM- be removed by making the model enable to understand major lan- RFE’s selection of feature sets can be mainly divided into guages. three steps, namely, (a) the input of the datasets to be classified, REFERENCES (b) calculation of weight of each feature, and [1] M. Basu, S. Ghosh, K. Ghosh, and M. Choudhury. 2017. Overview of the FIRE (c) the deletion of the feature of minimum weight to ob- 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis). tain the ranking of features. In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org. After this, the classifier used to classify our data was [2] C.D.Manning. 2011. Part-of-Speech Tagging from 97Time for Some Linguistics? DecisionTreeClassifier[4]. The tweets were ranked accord- (2011). https://doi.org/pubs/CICLing2011-manning-tagging.pdf ing to the probability of it being to that class. [3] Lee WM Li RK Jiang B-R Huang M-L, Hung Y-H. 2014. SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier. The Scientific World Journal (2014). https://doi.org/pmc/articles/PMC4175386/ [4] Witten D. Hastie-T. Tibshirani R. James, G. 2013. An Introduction to Statistical Learning. Springer. 2.3 Validation Data was split as 80% for training and 20% for testing and our model gave 84% accuracy. 2.4 Task-2 Need-tweets and availability-tweets were available after the classi- fication step. POS (Parts of Speech) tagging was used to remove all words other than Common Nouns from our need-tweets and avail- ability tweets. There is a nice paper on POS tagging with high accuracy[2]. Now for each word in need tweet, that word was searched in availability-tweets. If there is a match, it can be said that availability-tweet matches need-tweet. 3 RESULTS After the results were declared by the organizers, the following results were obtained: 2