Using WordNet for Query Expansion: ADAPT @ FIRE 2016 Microblog Track Wei li Debasis Ganguly Gareth J. F. Jones ADAPT Centre School of Computing Dublin City University, Dublin 9, Ireland {wli,dganguly,gjones}@computing.dcu.ie ABSTRACT this strategy using resources such as WordNet without tak- User-generated content on social websites such as Twitter ing into account the context within the topic, is that apart is known to be an important source of real-time informa- from matching with relevant items, we will also match large tion on significant events as they occur, for example natural numbers of non-relevant items. In this case our objective of disasters. Our participation in the FIRE 2016 Microblog increasing recall of relevant tweets, will be tempered by low track, seeks to exploit WordNet as an external resource precision arising from retrieval of non-relevant tweets. for synonym-based query expansion to support improved In following section, we first describe the task in overview, matching between search topics and the target Tweet collec- we then introduce our method and the experiments that we tion. The results of our participation in this task show that carried out using it, including details of the dataset and the this is an effective method for use with a standard BM25 external resources that we used, finally we present the results based information retrieval system for this task. that we obtained and draw conclusions. CCS Concepts 2. TRACK DESCRIPTION The FIRE 2016 Microblog track [1] requires the identifica- •Information systems → Query reformulation; tion of relevant items from within a large set of microblogs (tweets) posted during a recent disaster event, for a set of Keywords given topics (in TREC format). Each topic identifies a broad information need during a disaster, such as, what resources Microblog Search; WordNet; Query Expansion are needed by the population in the disaster affected area, what resources are available, what resources are required / 1. INTRODUCTION available in which geographical region, and so on. Specifi- User-generated content on social media websites such as cally, each topic contains a title, a brief description, and a Twitter is known to be an important real-time source of in- more detailed narrative describing in summary what types of formation on various events as they occur, including disaster tweets will be considered relevant to the topic. Task partic- events like floods, earthquakes and terrorist attacks. If in- ipants are required to develop methodologies for extracting formation relevant to these events can be reliably identified tweets that are relevant to each topic with high precision as automatically, there is huge potential to exploit it in the well as high recall. management of the response to these events by disaster and The main challenges for this ad-hoc search task are: relief agencies. This raises the challenge of developing meth- • Identifying specific keywords relevant to each broad ods to identify the true relevant information from among the topic within each tweet which only contains a few words vast volume of content posted to mainstream social media (140 characters at most), most of which do not con- channels. The FIRE 2016 Microblog task [1] is motivated by tain specific keywords relating to the disaster even the this scenario, and aims to promote development of informa- tweet itself is relevant to the search topic. tion retrieval (IR) methods to extract important information from microblogs posted during disasters. • Dealing with noise in the content of the short tweet Our analysis of the task data showed that a significant documents which are often written in an informal style problem in addressing this task is the difficulty in matching using abbreviations, colloquial terms, etc; the words present in each search topic and those used in the very short microblog documents. Such differences relate both to different choice of vocabulary in the topics and doc- 3. EXPERIMENTAL METHODS AND PRO- uments, and also to differing levels of specificity in the words used. In order to address this challenge, we investigate the CEDURES use of synonym-based query expansion using WordNet for We begin this section by summarising details of the dataset, each search topic. The motivation for this approach is to and then describe our experiments and the results obtained. expand the topic to enhance the chance of matching it with relevant tweets in the target collection. The risk of adopting 3.1 Data Listing 1: Json Parser Code import j s o n if name == ” m a i n ” : a D i c t = {} s o u r c e = open ( ’ ∗ . j s o n l ’ , ’ r ’ ) f o r l i n e in s o u r c e : data = j s o n . l o a d s ( l i n e ) t i d = ’ ’ + s t r ( data [ ’ i d ’ ] ) + ’ ’ t t e x t = ’ ’ + data [ ’ t e x t ’ ] . encode ( ’ u t f 8 ’ ) + ’ ’ aDict [ t i d ] = t t e x t source . close () t a r g e t = open ( ’ ∗ . t x t ’ , ’w ’ ) f o r i in a D i c t : l i n e = s t r ( i ) + ’ ’+ a D i c t [ i ] target . write ( l i n e ) t a r g e t . w r i t e ( ’ \n ’ ) print i target . close () In order to obtain the dataset of tweets for the task, we fol- be r e l e v a n t . However , g e n e r a l i z e d s t a t e m e n t s lowed the instruction provided by the task organizers. They w i t h o u t r e f e r e n c e t o any r e s o u r c e o r provided: m e s s a g e s a s k i n g f o r d o n a t i o n o f money would not be r e l e v a n t . • a text file of 50,068 tweetids; • a Python script, along with the libraries that are re- quired by this script, to crawl the tweets. 3.2 Experiments and Results Based on our observation of the probable query-document We used their instructions to download the listed tweets mismatch problems arising from the short length of the tweets arising from the Nepal earthquake in April 2015. A total and the differing use of vocabulary in the topics and the of 49,894 tweets were downloaded and written into a Json tweet, we explore the use of WordNet1 to improve the re- file. We then prepared a Json parser to decode and extract liability of query-document marching. We used WordNet the information that we needed which consisted of only the to generate synonyms for the terms in each topic. Two ex- tweet id and content of each tweet. Listing 1 shows the code. periments were conducted based on the WordNet. In these The provided query set contained 7 topics in TREC for- experiments, Lucene was used to index the tweet set and to mat, each of which contains three parts: title, brief descrip- carry out the IR. The indexing process followed the following tion, and a more detailed narrative on what type of tweets steps: will be considered relevant to the topic. Listing 2 presents an example of the TREC format topic: 1. entries from a list of 655 stop words were removed; Listing 2: TREC Topic Example 2. Porter stemmer was used for stemming the words; 3. BM25 was used for indexing with k 1=1.2, b=0.75. Number : FMT1 < t i t l e > WHAT RESOURCES WERE AVAILABLE 3.2.1 Query Expansion using WordNet D e s c r i p t i o n : I d e n t i f y t h e m e s s a g e s WordNet is an electronic lexical database and is regarded which d e s c r i b e t h e a v a i l a b i l i t y o f some as one of the most important resources available to researchers resources . in computational linguistics, text analysis, and many related N a r r a t i v e : A r e l e v a n t message must areas. Its design is inspired by current psycholinguistic and mention t h e a v a i l a b i l i t y o f some r e s o u r c e computational theories of human lexical memory. English l i k e food , d r i n k i n g water , s h e l t e r , c l o t h e s , nouns, verbs, adjectives, and adverbs are organized into syn- b l a n k e t s , human r e s o u r c e s l i k e v o l u n t e e r s onym sets, each representing one underlying lexicalized con- r e s o u r c e s to build or support i n f r a s t r u c t u r e cept. Different relations link the synonym sets [2]. l i k e t e n t s , water f i l t e r , power s u p p l y and WordNet has long been regarded as a potentially useful s o on . Messages i n f o r m i n g t h e a v a i l a b i l i t y resource for query expansion in IR. However, it has met with of t r a n s p o r t v e h i c l e s f o r a s s i s t i n g the 1 r e s o u r c e d i s t r i b u t i o n p r o c e s s would a l s o https://wordnet.princeton.edu/ limited success due to its tendency to include contextually the participants are matched, is generated using a “man- unrelated synonyms for query words which are unrelated. ual run” where human assessors were given the same set of One of the successful applications of WordNet in IR is found tweets and topics, and asked to identify all possible rele- in [4] which uses the comprehensive WordNet thesaurus and vant tweets using a search engine (Indri). While judging the its semantic relatedness measure modules to perform query participants’ runs, the track organizers arranged for a sec- expansion on a document retrieval task. The authors ob- ond round of assessments to judge the relevance of tweets tained a 7% improvement on retrieval effectiveness compare that are identified by the participants but were not identified to the performance of using original query for search. [3] during the first round of human assessment. combined terms obtained from three different resources, in- Results of our two runs are shown in Table 1. The table cluding WordNet for use as expansion terms, Their method shows results for 4 runs, two of them are automatic and the was tested on a TREC ad hoc test collection with impressive other two are semi automatic. In the automatic runs listed, results. our submissions were placed third. The Precision@20 of the In this experiment, we also use WordNet to carry out best automatic run result is 0.4357 where ours is 0.3786. query expansion. WordNet is used as external resource to However our automatic run achieved the best MAP@1000 generate the synonyms for each topic. We limited the num- value of 0.1103, which is an increase of 27.93% relative to ber of synonyms for each topic term to 20 maximum, some the best run. Our overall MAP is lower because we only terms received less synonyms. submitted the top 1000 tweets for each topic while other participants submitted more. We received first place for the 3.2.2 Experiment One semi-automatic method where our Precision@20 is 33.35% Our first run (named dcu fmt16 1) uses our first auto- higher than the second place run. These numbers show that matic method using WordNet query expansion. In this run, using WordNet to generate synonyms for topic terms is a the following 4 steps were applied: positive way to carry out query expansion for this Microblog task. 1. remove stop words from each topic; 2. use WordNet to generate the synonyms for each item 4. CONCLUSIONS AND FURTHER WORK in every topic; For our submissions to the FIRE 2016 Microblog Track, we employed WordNet as an external resource to carry out 3. these synonyms are used as expand terms and add query expansion by retrieving the synonyms of each topic them back to each topic; term and using them as the additional query terms to re- 4. use the expanded topics as new topic to search again formulate each topic. We conducted two runs using this (BM25 retrieval model is used for retrieval). method, an automatic run and a semi-automatic run. The semi-automatic involved manual selection of relevant tweets We use the combination of title and narrative fields of the from a first run and application of WordNet in a subsequent topic in combination as the original topic. An example of retrieval stage. Our automatic run received the third place an original topic and its extend version are shown in the among submission, however with the best MAP value. Our Appendix. semi-automatic run obtained the overall first place. These positive results show that when a topic is too general and 3.2.3 Experiment Two does not contain the necessary terms to match with rele- Our second run (named dcu fmt16 2) is an semi-automatic vant documents, using WordNet as an external resource to run which means that the manual selection is involved. This generate synonyms is a good way to make them more effec- run was carried out using the following steps: tive. Potentially, using WordNet to retrieve hypernym or hyponym for each topic term maybe another method worth 1. use the original topic to search and obtain a rank list; attempt for this task. 2. go through top 30 tweets from rank list to select 1-2 relevant tweets and to do query expansion. Number 5. ACKNOWLEDGEMENT 30 is selected to promise we could find at least one This research is supported by Science Foundation Ire- relevant tweet for some topics. land in the ADAPT Centre (Grant 13/RC/2106) (www. adaptcentre.ie) at Dublin City University. 3. remove the stopwords and duplicate terms from the select tweets, add the rest term to original topic; 6. REFERENCES 4. then, applied WordNet again on the expanded topics [1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 and find synonyms for these terms; Microblog track: Information Extraction from 5. finally, add the synonyms to each expanded topic to Microblogs Posted during Disasters. In Working notes generate new topics and use them as query to search of FIRE 2016 - Forum for Information Retrieval again to obtain the final search results. Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. 3.2.4 Experimental Results [2] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and Since the aim of this track is to identify a set of tweets K. Miller. Wordnet: An on-line lexical database. that are relevant to each topic, set-based evaluation metrics International Journal of Lexicography, 3:235–244, 1990. of precision, recall, and MAP are used for evaluation. The [3] D. Pal, M. Mitra, and K. Datta. Improving Query gold standard, against which the set of tweets identified by Expansion Using Wordnet. CoRR, abs/1309.4938, 2013. Table 1: Our Results and Comparison with Others Run Type Run Name Rank Precision@20 Recall@1000 MAP@1000 MAP Automatic run iiest saptarashmi bandyopadhyay 1 1 0.4357 0.3420 0.0869 0.1125 Automatic run dcu fmt16 1 3 0.3786 0.3578 0.1103 0.1103 Semi-auto run dcu fmt16 2 1 0.4286 0.3445 0.0815 0.0815 Semi-auto run iitbhu fmt16 1 2 0.3214 0.2581 0.0670 0.0827 [4] J. Zhang, B. Deng, and X. Li. Concept Based Query Expansion Using Wordnet. In Proceedings of the 2009 International e-Conference on Advanced Science and Technology, AST ’09, pages 52–55, Washington, DC, USA, 2009. IEEE Computer Society. Appendix Original topic: Number: FMT6 WHAT WERE THE ACTIVITIES OF VARIOUS NGOs / GOVERNMENT ORGANIZATIONS <narr> Narrative: A relevant message must contain in- formation about relief-related activities of different NGOs and Government organizations in rescue and relief opera- tion. Messages that contain information about the volun- teers visiting different geographical locations would also be relevant. However, messages that do not contain the name of any NGO / Government organization would not be relevant. Expanded topic: <num> Number: FMT1 <narr>were activities assorted respective several diverse ver- satile various NGOs government organization organisation arrangement system administration governance governing body establishment brass constitution formation organiza- tions a relevant message mustiness moldiness must incor- porate comprise hold bear carry control hold in check curb moderate take turn back arrest stop hold back contain infor- mation about relief related activities different organization organisation arrangement system administration governance governing body establishment brass constitution formation organizations indium atomic number four9 indiana hoosier state inwards inward in deliverance delivery saving deliver rescue relief operation. message content subject matter sub- stance messages that incorporate comprise hold bear carry control hold in check curb moderate take turn back arrest stop hold back contain information about volunteers visit see travel call in call inspect inflict bring down impose chew fat shoot breeze chat confabulate confab chitchat chit chat chatter chaffer natter gossip jaw claver visiting different ge- ographical location placement locating position positioning emplacement localization localisation fix locations would be- sides too likewise well also relevant. message content subject matter substance messages that do not incorporate com- prise hold bear carry control hold in check curb moderate take turn back arrest stop hold back contain name whatever whatsoever any NGO government organization would not relevant