Information Retrieval from Microblogs during Natural Disasters Roshni Chakraborty Maitry Bhavsar Indian Institute of Technology Indian Institute of Technology Patna, Bihar Patna, Bihar India India roshni.pcs15@iitp.ac.in bhavsar.mtcs15@iitp.ac.in ABSTRACT and conclusion in section 5. In this paper, we devise an information retrieval system which can filter and rank tweets according to relevance to 2. DATA COLLECTION AND the query. We devise methods to understand relationships PRE-PROCESSING among entities and action verbs from a small set of manually FIRE Microblog Track provided a set of about 50,000 annotated tweets. We further use these relationships to filter tweet-ids which we used to the access the tweets. We fil- tweets and rank them accordingly. Our results (as published tered only the relevant tweet information from these tweets, by FIRE Microblog Track) show that we have high precision that consist of tweet text, tweet id, etc. We further filtered score in detection of topmost 20 tweets. some tweets from the whole set of tweets. For example, during disasters, there are a number of tweets that express 1. INTRODUCTION grief, urge people to pray or help. These messages are gen- FIRE 2016 Microblog track [1] provided us with about eral messages, hence we made a bag of words that express 50,000 tweets related to Nepal Earthquake of April 2015. only urge, request, pray etc. and removed those tweets that In this paper, we segregate tweets into different categories, contain words from this bag. namely, availability of resources, requirement of resources, availability of medical resources and facilities, requirement 3. METHODOLOGY of medical resources and facilities and information related In this section, we discuss our procedure. We do not use to infrastructure destruction or restoration. We devised a any external source of information. We use NLTK toolkit1 mechanism to learn text attributes of tweets to segregate to perform text based analysis on tweets. We rely on tweet them into specific categories. text attributes to filter tweets of relevance. In order to We manually annotate a random sample of 1000 tweets understand the text attributes of tweets, we select a ran- into specified categories of information, a tweet can also be- dom sample of 1000 tweets from the whole set of 50,000 long to multiple groups. For example, a tweet of destruc- tweets. We manually group tweets according to different tion of a bridge also might convey information about the re- queries by FIRE, a tweet can nevertheless belong to differ- quirement of basic amenities. Hence, different categories of ent categories. tweets had different text attributes that pertain to a specific We perform a set of operations on tweet text for every information related to that query. We aimed at identifying group (as specified before). Firstly, we remove the stopwords those text attributes, i.e, combination of different words for from these tweets. Stopwords hardly represent any special any particular query. We further created networks of each characteristic of an entity. After removal of the stop words, query’s text attributes’ combinations. The edges represent we use POS Tagger to select only the nouns and verbs from the interrelationships among these text attributes which aid the tweets. We then rank the entities of all tweets according in segregation of tweets according to different queries. We to frequency. We select a subset of these entities according to will describe the methodology in details in later sections. the ranks, we also include the entities specified in the query Tweets are informal, so a vocabulary gap exists even among itself by FIRE. This step gives us a list of the important tweets of same strata. So, we did not depend only on text entities for a specific query. analysis of named entities, like food packets but rather com- Often, an entity to entity matching fails to resolve tweets bined them with the set of important verbs that identifies of different genre, i.e., a tweet containing information of a correct relationship among those. We weighed the differ- medical aids can either highlight availability or requirement ent identified keywords of each category according to their of the same. So, we identify the different set of possible relevance to the query. We, thereby, could identify tweets actions of any entities, to understand the underlying rela- due to their presence of relevant keywords for a query. The tionships. We further rank the bigrams to identify the set published results from FIRE suggest we could accurately of working verbs to highlight specific actions. Thus, this identify tweets of high relevance of different categories with set of related working verbs and entities signify tweets of a good precision and recall. particular category. We have divided the paper into following sections. We However, there remains a vocabulary gap among different discuss about data collection and pre-processing in the next tweets of even same category due to their informal structure. section, followed by our procedure of identification of tweets 1 in section 3 and finally results and discussion in section 4 www.nltk.org ually annotated tweets into similar groups. The main action verbs represent donations, transport, relief inf ormation, build. We represent the relationships between these different set of action verbs with different set of entities in the graphs 1 and 2 and the set of keywords of each group in table 1. Thus a new tweet is selected if it contains the existing rela- tionship, as mentioned by the arrow, i.e., it must contain at least an entity and verb from the nodes the arrow connects. Words Representing Node Name the Nodes Green4 off to nepal survivor, victim, Green5 affect food, water, cloth, blanket, biscuit, Green6 power, plane, bus, material, beef, equipment volunteer, helicopter, Green7 Figure 1: Graph Relationships of Resource Avail- item, tool, app ability Information Green8 team shelter, tent, house, Green9 home Green1 0 relief, rescue Blue1 donate transfer, sell, distribut, suppl, send, sent, deliver, Blue2 dispatch, offer, land, deploy, transport, prepar relief, rescue, working, aid, Blue3 support, engage, rush Blue4 build Figure 2: Graph Relationships of Resource Require- Blue5 need, want, require ment Information Table 1: Word Dictionary of Resource Availability and Requirement Related Information Tweets of both requirement and availability of medical re- sources may contain entities, like blood and working verbs like donate but are completely different in meaning. Hence, 3.1 Requirement of Resources segregation only on the basis of keywords fails to differen- In this section, we intend to filter all tweets that mention tiate these relationships. We analyze the context of those the requirement or need of some resource, like human re- keywords relationships, which reflects the actual meaning, sources or infrastructure like tents, water filter, power sup- as in the absence of question tags (like, where, how, what, ply, etc. We studied our manually annotated tweets, and etc), or request tags (please, etc) in availability based tweets. found the main action verbs that denote requirement of re- The segregation of tweets into different categories thus re- sources are, need related or relief related. We highlight the quires identification of proper entities, actions, and context different relationships among these various entities in figure to understand it’s relevance. 3 and include details of the different terms in table 1. Thus, We further have ranked an entity and the action verbs we later select those tweets from the total list of tweets if according to their importance, which we will explain later. it contains the relationship represented by the arrow, i.e., it We formulate separate bipartite graphs for each query, that contains at least an entity and action verb from the list of represents the relationships among the entities, actions and keywords that the arrow connects. context. While a set of nodes represent entities’ names, an- other set of nodes represent the names of verbs (i.e., actions). 3.2 Availability of Medical Resources These relationships were formulated from the manually an- In this section, we identify messages that mention the notated tweets. We give a brief overview of the specific availability of some medical resources like blood, blood bank, words and their relationships for each query in the next sec- medicine, etc. Firstly, we distinguish different action verbs tion. from the manually annotated tweets that contain informa- We select the different types of action verbs from our man- tion related to this query, the verbs are namely donation, Figure 3: Graph II Relationships of Medical Re- source Availability Information Figure 5: Graph Relationships for Devastation Re- lated Information 3.3 Requirement of Medical Resources In this section, we identify messages that mention the re- quirement of some medical resources like blood, blood bank, Figure 4: Graph Relationships of Medical Resource medicine, etc. We represent the actions and their corre- Requirement Information sponding entities in graph 5, the arrows represent the rela- tionships among the both. The table 2 represents the set of keywords for each entity or action. Thus, we filter all those transport, rescue etc. There are some action verbs that are tweets from the whole set of fifty thousand tweets which ambiguous in meaning, example need reflects both the need contain the relationships, i.e., at least a keyword from both and the availability of resources. On further analysis of need the nodes of an arrow. mentioned tweets, we found need is used in availability of resources tweets only in conditional statements (example, if 3.4 Infrastructure Damage And Report is a conditional clause). of Restoration We represent the actions and their corresponding entities In this section, we identify messages that mention the in the next two graphs, namely graph 4 and graph 2, the damage or restoration of any communication or structural arrows represent the relationships among the both. The infrastructures. However, the general statements about any table 2 represents the set of keywords for each entity or structure is not relevant. We filter the possible set of in- action. Thus, we filter all those tweets from the whole set frastructure names from our manual annotated tweets and of fifty thousand tweets which contain the relationships, i.e., the different set of actions related to them. After detection at least a keyword from both the nodes of an arrow. There of the relationships among the action verb and entity name are also some stringent relationships, that comprise of more from the manually annotated tweets, we select only those than just an entity and action name, as illustrated in graph tweets that contain We visualize the different relationships 4. among different set of entities in Figure 6, and highlight the set of keywords in table 3. Words Representing Node Name Words Representing the Nodes Node Name blood, bloodbank, the Nodes Green5 medicine, medical, hotel, debris, doctor building, temple, healthcare, hospital, rubble, tower, road, Green6 bridge, house, patient, diabities provide, survivor, Green1 railway, dam, tent, Green7 heritage, monument, victim,affect Blue1 donate, donated power grid, engineer, equipment, reach, transfer, sell, electricity distribut, suppl, send, sent, deliver, reduce, flatten, Blue2 destroy, devastat, dispatch, offer, land, deploy, transport, avalanche, Blue1 prepar, continu damage, restore, capture, collapse, rescue, relief, Blue3 build, builds support, engag, rush devastat, terrif, Blue4 need, want require Blue2 heartbreak call, contact, Blue5 footage, image, helpline Blue3 picture Table 2: Word Dictionary of Medical Resource Table 3: Word Dictionary for Devastation Related Availability and Requirement Information Information Node Name Words Representing the Nodes Score Action1 relief, rescue, aid 0.10 Action2 build, transfer, sell, distribut, send, sent, deliver, supply, donat, need 0.4 Action3 deploy, dispatch, lad, transport, fly 0.3 Action4 prepare, offer, launch, allow, provide, make, support, engag, rush, help, working, in action 0.2 Entity1 volunteer, food, biscuit, shelter, tent, house, home, cloth, blanket 0.7 Entity2 power, equipment, material, item, team, helicopter, bus, plane, call, helpline,contact 0.5 Table 4: KeyWord Relevance Score 4. SELECTION OF TWEETS 1. Precision at rank 20, i.e., considering up to the top 20 The above graphs represent different entities, and their set tweets for each topic. of actions for a particular query. For a given query, we match 2. Recall at rank 1000. the relationships among the new tweet with the prescribed relationships. Thus, a tweet is selected if it contains the 3. Mean Average Precision at rank 1000. specified relationships of entities of that query. We further rank those tweets according to it’s relevance to the query in 4. MAP overall, i.e., considering all tweets retrieved in the next section. the run. 4.1 Score of Tweets 6. CONCLUSION In this section, we rank the selected tweets by their rel- In this paper, we devise a mechanism to extract the con- evance to query. In order to rank the tweets, we score the textual, content relationships of entities. We are able to fil- different keyword relationships of a query. The keywords ter tweets of high relevance for different queries by matching are segregated into two different sections, entities and ac- these relationships. We require a small number of manual tion verbs. We give importance to words that signify better annotated tweets to attain our results. temporal relevance than others, i.e., there is a major differ- ence between tweets like food items sent to affected areas by Indian government, India dispatched 500 packets of rice to 7. REFERENCES Nepal and India will dispatch food packets by saturday. We [1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 give a brief description of our scoring mechanism. Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes 1. T emporal Importance : An action verb is given more of FIRE 2016 - Forum for Information Retrieval importance if it highlights immediate action rather Evaluation, Kolkata, India, December 7-10, 2016, than future. This is illustrated by Action2 and Action3 , CEUR Workshop Proceedings. CEUR-WS.org, Action4 . December 2016. 2. Relevance : Some action verbs, represent greater rel- evance in times of calamity, as expressed in Action1 . Similarly, there are some entities (as in Entity1 ), which are the basis needs of human livelihood, like food and shelter which are more important than information re- lated to other entity (as in Entity2 ). The different scores of the keywords are given in table 4. Thus, a tweet’s score is the summation of it’s keywords’ scores. We hereby, could rank the tweets by their relevance score accordingly. Metric Name Result P recision@20 0.770 Recall@1000 0.4344 M AP @1000 0.2186 OverallM AP 0.2208 Table 5: Result 5. RESULTS In this section, we highlight our results, FIRE Microblog Track matched our selected tweets with a manual annota- tor’s results. We briefly give an explanation of the metrics and our results are depicted in Table 5. The metrics are.