Introduction

Microblog Retrieval for Disaster Relief: How To Create Ground Truths?

IIT(BHU) Varanasi

ribhav.soni.cse

spal.cse}@iitbhu.ac.in

Microblogging services like Twitter are an important source of real-time information during disasters and can be utilized to aid rescue, relief and rehabilitation e orts. The focus of this work is on the creation of gold standard data for automatic retrieval of helpful tweets. Using various experiments on the gold standard data prepared in the FIRE 2016 Microblog Track [3], we show that the gold standard data prepared in [3] missed many relevant tweets. We also demonstrate that using a machine learning model can help in retrieving the remaining relevant tweets by training an SVM model on a subset of the data and using it to get the most useful tweets in the entire dataset. We obtain high precision and recall even with very little training data, which makes such a model suitable for use in a real-time disaster situation.

Crisis Informatics Disaster Emergency Hazards Microblog Retrieval Social Media Text Categorization

Introduction

Social media is a very useful resource for obtaining real-time information during disasters. Traditional media like television, newspaper, etc. have limited use for aiding in disaster relief due to their slow updates, and may even be unavailable due to the disaster event. In such situations, social media presents valuable information to aid in disaster relief and rehabilitation with very little time overhead [ 1 ].

Twitter in particular is especially suited for extracting details and rst-hand accounts within moments of an event, anywhere in the world [ 6 ], and can thus be exploited for help in relief work. However, it also involves challenges of ltering out information about the crisis situation that is not useful for relief e orts, including tweets expressing shock, condolences, opinion, etc. Some tweets that are not useful for disaster relief e orts are shown in Table 1.

The FIRE 2016 Microblog Track [ 3 ] focused on comparing di erent IR methodologies for retrieval in such scenario, and led to the creation of a benchmark collection of ground truth data for such tasks. However, based on our experiments, we argue that the ground truth annotation exercise missed up to four times as many tweets as were found. This represents a signi cant loss of information that could potentially be very useful in a disaster situation. Also, since the accuracy of

Tweet Text RT @tarsem insan:,@Gurmeetramrahim Guru ji #MSGHelpEarthquakeVictims I m also Shocked!!!,hearing #earthquake #MSGHelpEarthquakeVictims RT @vrinda 90:,really sad to hear about d earthquake. praying for all the ppl who su ered,& lost their loved ones. hope they get all the h The Government is,so quick to help earthquake victims but why are they so reluctant to our own,farmers needs? Haven't studied anything coz of earthquake and have to go for exam. RT @guthali2:,Imagine Kejriwal were the PM in Nepal Earthquake situation, " Hum kuch,nai kar sakte hai jee, army president ke neeche hai". gold standard data is crucial for evaluation and comparison of retrieval systems, it may lead to weaker systems being ranked above better systems.

First, we manually labeled a small, random subset of the data and found that many relevant tweets were missing from the gold standard in [ 3 ]. We then proceeded to train an SVM model on a subset of the data, and used it to retrieve 100 tweets with the highest con dence score of the trained model. We found that, averaged across all topics, only less than half of the relevant tweets among those were identi ed in the gold standard in [ 3 ].

We also performed bootstrapping on the labeled random subset to estimate the number of relevant tweets in the entire collection, and obtained about 5 times the relevant tweets from the gold standard in [ 3 ]. Also, we trained our SVM model on small fractions of the training data, and obtained high precision and recall even with very little training data, which shows that such a model can be used e ectively in disaster situations with very low time overhead.

The rest of this paper is organized as follows. We rst describe the data used in Section 2, our experiments and results in Section 3, and discussion and future work in Section 4. 2

Data

We used the dataset provided by the organizers of the FIRE 2016 Microblog Track [ 3 ]. The data was a collection of 50,068 tweets posted during the earthquake in Nepal in 2015 1.

Organizations involved in relief work during disasters need speci c, actionable information to help in the relief e orts. Thus, a set of seven speci c information needs were identi ed by the authors in [ 3 ] after consulting members of such organizations.

The task in [ 3 ] involved retrieving tweets relevant to each of these seven information needs, expressed as topics in TREC format. The seven topics are listed in Table 2. 1 https://en.wikipedia.org/wiki/April 2015 Nepal earthquake < num>Number: FMT1 < title>What resources were available < desc>Identify the messages which describe the availability of some resources. < narr>A relevant message must mention the availability of some resource like food, drinking water, shelter, clothes, blankets, human resources like volunteers, resources to build or support infrastructure, like tents, water lter, power supply and so on. Messages informing the availability of transport vehicles for assisting the resource distribution process would also be relevant. However, generalized statements without reference to any resource or messages asking for donation of money would not be relevant. < num>Number: FMT2 < title>What resources were required < desc>Identify the messages which describe the requirement or need of some resources. < narr>A relevant message must mention the requirement / need of some resource like food, water, shelter, clothes, blankets, human resources like volunteers, resources to build or support infrastructure like tents, water ter, power supply, and so on. A message informing the requirement of transport vehicles assisting resource distribution process would also be relevant. However, generalized statements without reference to any particular resource, or messages asking for donation of money would not be relevant. < num>Number: FMT3 < title>What medical resources were available < desc>Identify the messages which give some information about availability of medicines and other medical resources. < narr>A relevant message must mention the availability of some medical resource like medicines, medical equipments, blood, supplementary food items (e.g., milk for infants), human resources like doctors/sta and resources to build or support medical infrastructure like tents, water lter, power supply, ambulance, etc.

Generalized statements without reference to medical resources would not be relevant. < num>Number: FMT4 < title>What medical resources were required < desc>Identify the messages which describe the requirement of some medicine or other medical resources. < narr>A relevant message must mention the requirement of some medical resource like medicines, medical equipments, supplementary food items, blood, human resources like doctors/sta and resources to build or support medical infrastructure like tents, water lter, power supply, ambulance, etc. Generalized statements without reference to medical resources would not be relevant. < num>Number: FMT5 < title>What were the requirements / availability of resources at speci c locations < desc>Identify the messages which describe the requirement or availability of resources at some particular geographical location. < narr>A relevant message must mention both the requirement or availability of some resource, (e.g., human resources like volunteers/medical sta , food, water, shelter, medical resources, tents, power supply) as well as a particular geographical location. Messages containing only the requirement / availability of some resource, without mentioning a geographical location would not be relevant. < num>Number: FMT6 < title>What were the activities of various NGOs / Government organizations < desc>Identify the messages which describe on-ground activities of di erent NGOs and Government organizations. < narr>A relevant message must contain information about relief-related activities of di erent NGOs and Government organizations in rescue and relief operation.

Messages that contain information about the volunteers visiting di erent geographical locations would also be relevant. However, messages that do not contain the name of any NGO / Government organization would not be relevant. < num>Number: FMT7 < title>What infrastructure damage and restoration were being reported < desc>Identify the messages which contain information related to infrastructure damage or restoration. < narr>A relevant message must mention the damage or restoration of some speci c infrastructure resources, such as structures (e.g., dams, houses, mobile tower), communication infrastructure (e.g., roads, runways, railway), electricity, mobile or Internet connectivity, etc. Generalized statements without reference to infrastructure resources would not be relevant.

The gold standard preparation in [ 3 ] involved three phases, which can be brie y summarized as follows.

1. Three annotators independently tried to search for relevant tweets using intuitive keywords, after all tweets were indexed using Indri.

2. All tweets identi ed by at least one of the three annotators in Phase 1 were considered and their relevance annotation nalized by mutual discussion among the annotators.

3. Standard pooling was employed, taking the top 30 results from each run and deciding on their relevance.

The initial collection by the authors of [ 3 ] consisted of about 100,000 tweets, and the nal dataset of 50,068 tweets was obtained by removing duplicate tweets (tweets with similarity greater than a threshold). The collection still included many tweets that were not duplicates but expressed almost the same information. All such instances were classi ed as relevant in the annotation exercise. 3 3.1

Experiments and Results

Exhaustive labeling on a small, random subset A set of 700 tweets was randomly chosen, and relevance was judged for each tweet in the set separately for each of the seven topics. Within the random sample, the number of relevant tweets identi ed in the gold standard in [ 3 ] and those identi ed by exhaustive labeling are given in Table 3.

As we can see, within the random sample, the number of relevant tweets identi ed by our exhaustive annotation was about 5 times of that identi ed in the gold standard in [ 3 ].

Bootstrapping to estimate the number of relevant documents in the entire collection After exhaustively labeling the random sample of 700 tweets, we used Bootstrapping [ 2 ] for estimating the number of relevant tweets in the whole collection. Bootstrapping is a resampling method that involves random sampling with replacement, so we generated 1000 samples, each of size 700 tweets, from our sample of 700 tweets with replacement. The number of relevant tweets in each sample was computed, and then its average was taken across all 1000 samples. The resulting number of tweets, divided by the sample size, was taken to be an estimate for the fraction of relevant tweets in the entire collection. We thus estimated the number of relevant tweets in the collection of 50,068 tweets to be about 7,520 tweets (i.e., 15.02% of the tweets).

On the contrary, only 1,565 relevant tweets (3.13% of the tweets) were identi ed in the gold standard in [ 3 ]. This represents a loss of about 6,000 useful tweets missed by the annotators in [ 3 ]. 3.3

Machine Learning for automatic ltering of tweets We trained machine learning models for automatic classi cation of tweets into topics, with the aim of automatically retrieving the most useful tweets that may have been missed in the annotation exercise in [ 3 ]. As one tweet can be relevant to multiple topics, we applied supervised machine learning models separately for each topic, thus training a total of seven binary classi ers.

We used Support Vector Machines (SVM) for our classi cation task, as they have been found to be among the best models for text classi cation [ 4 ] [ 5 ]. We used the implementation of LinearSVC (SVM with linear kernel) in the scikitlearn machine-learning library [ 7 ].

Training data As seen in Table 3, we could identify at most only 53 relevant tweets for one topic out of a sample of 700 tweets. Thus, the classi cation task is highly skewed, with non-relevant tweets forming a large majority.

To overcome the problems associated with such skewed classi cation, we used undersampling, i.e., we balanced the training data by taking only as many non-relevant tweets as we had relevant tweets.

Besides the positively labeled tweets that we labeled from our sample of 700 tweets, we also had the set of relevant gold standard tweets from [ 3 ] to use for our machine learning task. Table 4 lists the nal number of labeled tweets that we used for each of the topics. (Our number of gold standard tweets are slightly less than in the original gold standard because we could not download about 500 tweets from the original collection from twitter due to those tweets getting deleted in the meantime. Also, the number of relevant tweets from the two sources, manual labeling by us of the sample of 700 tweets and gold standard in [ 3 ], do not add up perfectly, because some tweets are common between them.)

We applied minimal preprocessing on the tweets. The only operation that we applied was the removal of hashtag symbols (retaining the attached text).

We randomly divided the available training data into 70% for training and 30% for testing, for each topic. Feature Extraction Scikit-learn's CountVectorizer was used to extract token counts with a bag-of-words model. We experimented using (1) unigram features only, and (2) both unigram and bigram features, and got better results using unigram features only. We thus used only unigram features for all our remaining experiments. Also, no stemming or stopword removal was done, and tokenization of tweets was done by extracting words of at least 2 letters.

Then, T dfTransformer was used to convert the raw counts to tf-idf weights. Thus, a bag-of-words model with unigram features of tf-idf weights was used.

Each experiment was carried out for 100 iterations with random partitions of the data in each iteration to training (70%) and test sets (30%), and the average of all performance metrics for the 100 iterations was taken.

Results The performance of the classi ers based on various metrics are shown in Table 5. The precision-recall curve of the classi er for topic FMT1 is also shown. Precision-Recall curve for the SVM classi er for topic FMT1 ) % ( n o i s i c e r P 100 90 80 70 60 50 50 60 70 Recall (%) 80 90 100 3.4

Classi cation performance with number of examples We tested the performance of our classi ers when using only a fraction of the available data. For each classi er and each given fraction of data, we randomly took a subset of the usable data for 100 iterations, and took the average of the performance scores for the classi er on the 100 iterations. The F1 scores of the classi ers with varying fractions of the data are shown in Table 6.

Retrieving most relevant tweets in the entire collection We used the trained classi ers to retrieve the 100 most relevant tweets for each topic in the entire dataset by taking the 100 tweets with the maximum con dence scores of each classi er.

We manually checked the sets of 100 tweets corresponding to the seven topics to determine how many of them were actually relevant, and how many of the relevant ones were identi ed by the gold standard in [ 3 ]. The results of this exercise are shown in Table 7.

2. Pooling works only when the number of participating systems is large, and the systems are diverse. Unlike tracks on TREC, the number of participants in [ 3 ] was not large, and so standard pooling employed in Phase 3 also failed to nd all relevant tweets. ([ 9 ] studies the reliability of pooling, and concludes that it is reliable if the depth of the pool is deep enough, i.e., many of the top results from all systems are taken into account, which is true for TREC with a depth of top 100 documents from each participating system, but taking only top 30 documents as was done in [ 3 ] may not have been enough.)

Since exhaustive annotation is not possible for the complete collection, to nd relevant tweets in the remaining collection, a machine learning model as presented in this paper can be trained and used on the remaining data to retrieve the tweets with the highest con dence scores, and then manual con rmation of the relevance can be carried out for as many tweets as annotator time permits.

Another approach could be to exhaustively annotate a small random subset of the data, and then use keywords of the relevant-marked tweets to query into the entire collection, to retrieve relevant tweets in the remaining collection. This is one future possibility for us to experiment with.

Some of the relevant tweets that were missed in the creation of gold standard in [ 3 ] are listed in Table 8.

We were able to achieve reasonably high F1 scores for our classi ers even with a training size of a few hundred examples (Table 6). This shows that automatic text classi cation is a viable approach to extract useful information from tweets during times of disasters, since a few hundred examples can easily be annotated in a short amount of time. It may also be fruitful to train supervised machine learning models in advance for di erent types of disaster situations, and use them in times of disaster until newly annotated data is obtained.

To improve on the machine learning model, some avenues to explore are: { using more features, including word embeddings, spatio-temporal features, linguistic features (as used in [ 8 ]), etc. { employing better preprocessing techniques, like using twitter-speci c spelling correction, expanding common twitter abbreviations, better data cleaning, etc. 5

Acknowledgements

We thank the anonymous reviewers for their thorough comments.

1. Internet becomes a lifeline in nepal after earthquake . http://www.computerworld.com/article/2914641/internet/ internet -becomes-a-lifeline-in-nepal-after-earthquake .html, accessed: 2017 -03-16

2. Efron , B. , Tibshirani , R.J.: An introduction to the bootstrap . CRC press ( 1994 )

3. Ghosh , S. , Ghosh , K. : Overview of the re 2016 microblog track: Information extraction from microblogs posted during disasters . Working notes of FIRE pp. 7 { 10 ( 2016 )

4. Joachims , T. : Text categorization with support vector machines: Learning with many relevant features . In: European conference on machine learning . pp. 137 { 142 . Springer ( 1998 )

5. Khan , A. , Baharudin , B. , Lee , L.H. , Khan , K. : A review of machine learning algorithms for text-documents classi cation . Journal of advances in information technology 1(1) , 4 { 20 ( 2010 )

6. Mills , A. , Chen , R. , Lee , J. ,

Raghav

Rao , H.: Web 2.0 emergency applications: How useful can twitter be for emergency response ? Journal of Information Privacy and Security 5 ( 3 ), 3 { 26 ( 2009 )

7. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , Vanderplas , J. , Passos , A. , Cournapeau , D. , Brucher , M. , Perrot , M. , Duchesnay , E.: Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 { 2830 ( 2011 )

8. Rudra , K. , Ghosh , S. , Ganguly , N. , Goyal , P. , Ghosh , S. : Extracting situational information from microblogs during disaster events: a classi cation-summarization approach . In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management . pp. 583 { 592 . ACM ( 2015 )

9. Zobel , J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval . pp. 307 { 314 . ACM ( 1998 )