DataBros@Information Retrieval from Microblogs during
                       Disasters(IRMiDis)
                           Naveen Kumar                                                        Mradul Dubey
                            IIIT Kalyani                                                        IIIT Kalyani
                 A10/137, IIIT Kalyani Boys Hostel                                   A10/137, IIIT Kalyani Boys Hostel
                   Kalyani, West Bengal 741235                                         Kalyani, West Bengal 741235
                     naveen.pwn@gmail.com                                             mraduldubey@iiitkalyni.ac.in

ABSTRACT                                                                 1.3 Sub-task 1: Identifying need-tweets and
Microblogging sites like Twitter are increasingly being used for             availability-tweets
aiding relief operations during disaster events. In such situations,     Here the participants need to develop automatic methodologies for
identifying actionable information like needs and availabilities of      identifying need-tweets and availability-tweets. This is mainly a
various types of resources is critical for effective coordination of     search problem, where relevant microblogs have to be retrieved.
post disaster relief operations. However, such critical information      However, apart from search, the problem of identifying need-tweets
is usually submerged within a lot of conversational content, such        and availability-tweets can also be viewed as a pattern matching
as sympathy for the victims of the disaster. Hence, automated IR         problem, or a classification problem (e.g., where tweets are clas-
techniques are needed to find and process such information.[1]           sified into three classes- need-tweets, availability-tweets, and oth-
                                                                         ers).
CCS CONCEPTS
•Data Science →Machine Learning; NLP; Tweet Extraction;

KEYWORDS                                                                 1.4 Sub-task 1: Matching need-tweets and
Data Mining, NLP, Machine Learning                                           availability-tweets
                                                                         An availability-tweet is said to match a need-tweet, if the availability-
                                                                         tweet informs about the availability of at least one resource whose
                                                                         need is indicated in the need-tweet. Table 1 shows some examples
                                                                         of need-tweets and matching availability-tweets. In this sub-task,
1 INTRODUCTION                                                           the participants are required to develop methodologies for match-
In this track, focus is on two types of tweets:                          ing need-tweets with appropriate availability-tweets. Note that
                                                                         an availability-tweet is considered to match a need- tweet even if
1.1 Need-tweets:                                                         there is a partial match of the resources, e.g. if the need-tweet
                                                                         mentions about multiple resources and the availability-tweet in-
Tweets which inform about the need or requirement of some spe-
                                                                         form the availability of a subset of these resources. Also, note that
cific resource such as food, water, medical aid, shelter, mobile or
                                                                         a need-tweet and a matching availability-tweet can be in different
Internet connectivity, etc. Note that tweets which do not directly
                                                                         languages; either or both might be code-switched as well.
specify the need, but point to scarcity or non-availability of some
resources (i.e., a covert expression of the need) are also included
in this category. For instance, the tweet ”Mobile phones not work-       2 METHODOLOGIES
ing” is considered as a need-tweet, since it informs about the need      2.1 Dataset & Preprocessing
for mobile connectivity.
                                                                         The python code which was provided for us was used to crawl
                                                                         both train data and test data. The twitter data was crawled in json
1.2 Availability-tweets:                                                 format. It was then converted into a csv file by taking tweet-id,
Tweets which inform about the availability of some specific re-          text, and its class as attribute. Classes were assumed as 0 for non-
sources. This class includes both tweets which inform about poten-       relevant tweets, 1 for need-tweets, and 2 for availability-tweets.
tial availability, such as resources being transported or dispatched
to the disaster-struck area, as well as tweets informing about the       All characters other than alphabets were removed from our tweets
actual availability in the disaster-struck area, such as food being      and converted them in small letters. Stopwords were also removed
distributed, etc. Note that a particular tweet may be both a need-       and then stemming was done so that similar words with different
tweet and an availability-tweet if it informs about the need of some     verb forms could be treated as same. After that, most common
specific resource, as well as the availability of some other resource.   words among the tweets were found. They were removed except
                                                                         some selected words which are [medical, need, give, relief, fund,
                                                                         food, donate, aid, water, meal, send, offer, finance, blood]. Also all
  The track will have two sub-tasks, as described below:                 the retweets and redundant tweets were removed.
2.2 Model
    (1) First model was a simple Bag-Of-Words (BOW) model. It
        selects the features from the tweets as vocabulary and keeps
        most important features at the top. It gave a good result
        but it was not enough.
    (2) TfidfVectorizer was used to collect the features. It included
        the unigrams and bigrams. Limit to max features extrac-
        tion was kept to 6000. After that Recursive Feature Elimi-
        nation (RFE)[3] was used with LinearSVM as estimator to
        select 1000 most informative features. The main purpose
        of SVM-RFE is to compute the ranking weights for all fea-           4 FUTURE WORK
        tures and sort the features according to weight vectors as
                                                                            Lemmatization can be used in place of Stemming which will give
        the classification basis. SVM-RFE is an iteration process
                                                                            more accurate context. Also, PCA can be used in place of RFE.
        of the backward removal of features. Its steps for feature
                                                                            Future work can include using spellchecker and correcting it, us-
        set selection are shown as follows:
                                                                            ing wordnet for getting more accurate features and will make our
         (a) Use the current dataset to train the classifier.
                                                                            model                                              more-flexible.
         (b) Compute the ranking weights for all features.
         (c) Delete the feature with the smallest weight.
                                                                            Task-2 can be much improved by finding more accurately what
        Implement the iteration process until there is only one
                                                                            is needed and also in which exact location. Need-tweet can be
        feature remaining in the dataset; the implementation re-
                                                                            matched with that availability-tweet which can fulfill most of its
        sult provides a list of features in the order of weight. The
                                                                            demands or if the quantity is given of the needed thing, it can
        algorithm will remove the feature with smallest ranking
                                                                            be matched to that availability-tweet that has that amount of the
        weight, while retaining the feature variables of significant
                                                                            needed thing. Also, it will be helpful to match the tweet with that
        impact. Finally, the feature variables will be listed in the
                                                                            tweet which is more geographically closer. Language barrier can
        descending order of explanatory difference degree. SVM-
                                                                            be removed by making the model enable to understand major lan-
        RFE’s selection of feature sets can be mainly divided into
                                                                            guages.
        three steps, namely,
         (a) the input of the datasets to be classified,
                                                                            REFERENCES
         (b) calculation of weight of each feature, and
                                                                            [1] M. Basu, S. Ghosh, K. Ghosh, and M. Choudhury. 2017. Overview of the FIRE
         (c) the deletion of the feature of minimum weight to ob-               2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis).
             tain the ranking of features.                                      In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR
                                                                                Workshop Proceedings). CEUR-WS.org.
        After this, the classifier used to classify our data was            [2] C.D.Manning. 2011. Part-of-Speech Tagging from 97Time for Some Linguistics?
        DecisionTreeClassifier[4]. The tweets were ranked accord-               (2011). https://doi.org/pubs/CICLing2011-manning-tagging.pdf
        ing to the probability of it being to that class.                   [3] Lee WM Li RK Jiang B-R Huang M-L, Hung Y-H. 2014. SVM-RFE Based Feature
                                                                                Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier.
                                                                                The Scientific World Journal (2014). https://doi.org/pmc/articles/PMC4175386/
                                                                            [4] Witten D. Hastie-T. Tibshirani R. James, G. 2013. An Introduction to Statistical
                                                                                Learning. Springer.
2.3 Validation
Data was split as 80% for training and 20% for testing and our model
gave 84% accuracy.


2.4 Task-2
Need-tweets and availability-tweets were available after the classi-
fication step. POS (Parts of Speech) tagging was used to remove all
words other than Common Nouns from our need-tweets and avail-
ability tweets. There is a nice paper on POS tagging with high
accuracy[2]. Now for each word in need tweet, that word was
searched in availability-tweets. If there is a match, it can be said
that availability-tweet matches need-tweet.


3 RESULTS
After the results were declared by the organizers, the following
results were obtained:
                                                                        2