Using WordNet for Query Expansion: ADAPT @ FIRE 2016
Microblog Track
Wei li Debasis Ganguly Gareth J. F. Jones
ADAPT Centre
School of Computing
Dublin City University, Dublin 9, Ireland
{wli,dganguly,gjones}@computing.dcu.ie
ABSTRACT this strategy using resources such as WordNet without tak-
User-generated content on social websites such as Twitter ing into account the context within the topic, is that apart
is known to be an important source of real-time informa- from matching with relevant items, we will also match large
tion on significant events as they occur, for example natural numbers of non-relevant items. In this case our objective of
disasters. Our participation in the FIRE 2016 Microblog increasing recall of relevant tweets, will be tempered by low
track, seeks to exploit WordNet as an external resource precision arising from retrieval of non-relevant tweets.
for synonym-based query expansion to support improved In following section, we first describe the task in overview,
matching between search topics and the target Tweet collec- we then introduce our method and the experiments that we
tion. The results of our participation in this task show that carried out using it, including details of the dataset and the
this is an effective method for use with a standard BM25 external resources that we used, finally we present the results
based information retrieval system for this task. that we obtained and draw conclusions.
CCS Concepts 2. TRACK DESCRIPTION
The FIRE 2016 Microblog track [1] requires the identifica-
•Information systems → Query reformulation;
tion of relevant items from within a large set of microblogs
(tweets) posted during a recent disaster event, for a set of
Keywords given topics (in TREC format). Each topic identifies a broad
information need during a disaster, such as, what resources
Microblog Search; WordNet; Query Expansion
are needed by the population in the disaster affected area,
what resources are available, what resources are required /
1. INTRODUCTION available in which geographical region, and so on. Specifi-
User-generated content on social media websites such as cally, each topic contains a title, a brief description, and a
Twitter is known to be an important real-time source of in- more detailed narrative describing in summary what types of
formation on various events as they occur, including disaster tweets will be considered relevant to the topic. Task partic-
events like floods, earthquakes and terrorist attacks. If in- ipants are required to develop methodologies for extracting
formation relevant to these events can be reliably identified tweets that are relevant to each topic with high precision as
automatically, there is huge potential to exploit it in the well as high recall.
management of the response to these events by disaster and The main challenges for this ad-hoc search task are:
relief agencies. This raises the challenge of developing meth-
• Identifying specific keywords relevant to each broad
ods to identify the true relevant information from among the
topic within each tweet which only contains a few words
vast volume of content posted to mainstream social media
(140 characters at most), most of which do not con-
channels. The FIRE 2016 Microblog task [1] is motivated by
tain specific keywords relating to the disaster even the
this scenario, and aims to promote development of informa-
tweet itself is relevant to the search topic.
tion retrieval (IR) methods to extract important information
from microblogs posted during disasters.
• Dealing with noise in the content of the short tweet
Our analysis of the task data showed that a significant
documents which are often written in an informal style
problem in addressing this task is the difficulty in matching
using abbreviations, colloquial terms, etc;
the words present in each search topic and those used in
the very short microblog documents. Such differences relate
both to different choice of vocabulary in the topics and doc- 3. EXPERIMENTAL METHODS AND PRO-
uments, and also to differing levels of specificity in the words
used. In order to address this challenge, we investigate the
CEDURES
use of synonym-based query expansion using WordNet for We begin this section by summarising details of the dataset,
each search topic. The motivation for this approach is to and then describe our experiments and the results obtained.
expand the topic to enhance the chance of matching it with
relevant tweets in the target collection. The risk of adopting 3.1 Data
Listing 1: Json Parser Code
import j s o n
if name == ” m a i n ” :
a D i c t = {}
s o u r c e = open ( ’ ∗ . j s o n l ’ , ’ r ’ )
f o r l i n e in s o u r c e :
data = j s o n . l o a d s ( l i n e )
t i d = ’ ’ + s t r ( data [ ’ i d ’ ] ) + ’ ’
t t e x t = ’ ’ + data [ ’ t e x t ’ ] . encode ( ’ u t f 8 ’ ) + ’ ’
aDict [ t i d ] = t t e x t
source . close ()
t a r g e t = open ( ’ ∗ . t x t ’ , ’w ’ )
f o r i in a D i c t :
l i n e = s t r ( i ) + ’ ’+ a D i c t [ i ]
target . write ( l i n e )
t a r g e t . w r i t e ( ’ \n ’ )
print i
target . close ()
In order to obtain the dataset of tweets for the task, we fol- be r e l e v a n t . However , g e n e r a l i z e d s t a t e m e n t s
lowed the instruction provided by the task organizers. They w i t h o u t r e f e r e n c e t o any r e s o u r c e o r
provided: m e s s a g e s a s k i n g f o r d o n a t i o n o f money would
not be r e l e v a n t .
• a text file of 50,068 tweetids;
• a Python script, along with the libraries that are re-
quired by this script, to crawl the tweets. 3.2 Experiments and Results
Based on our observation of the probable query-document
We used their instructions to download the listed tweets
mismatch problems arising from the short length of the tweets
arising from the Nepal earthquake in April 2015. A total
and the differing use of vocabulary in the topics and the
of 49,894 tweets were downloaded and written into a Json
tweet, we explore the use of WordNet1 to improve the re-
file. We then prepared a Json parser to decode and extract
liability of query-document marching. We used WordNet
the information that we needed which consisted of only the
to generate synonyms for the terms in each topic. Two ex-
tweet id and content of each tweet. Listing 1 shows the code.
periments were conducted based on the WordNet. In these
The provided query set contained 7 topics in TREC for-
experiments, Lucene was used to index the tweet set and to
mat, each of which contains three parts: title, brief descrip-
carry out the IR. The indexing process followed the following
tion, and a more detailed narrative on what type of tweets
steps:
will be considered relevant to the topic. Listing 2 presents
an example of the TREC format topic:
1. entries from a list of 655 stop words were removed;
Listing 2: TREC Topic Example 2. Porter stemmer was used for stemming the words;
3. BM25 was used for indexing with k 1=1.2, b=0.75.
Number : FMT1
< t i t l e > WHAT RESOURCES WERE AVAILABLE 3.2.1 Query Expansion using WordNet
D e s c r i p t i o n : I d e n t i f y t h e m e s s a g e s
WordNet is an electronic lexical database and is regarded
which d e s c r i b e t h e a v a i l a b i l i t y o f some
as one of the most important resources available to researchers
resources .
in computational linguistics, text analysis, and many related
N a r r a t i v e : A r e l e v a n t message must
areas. Its design is inspired by current psycholinguistic and
mention t h e a v a i l a b i l i t y o f some r e s o u r c e
computational theories of human lexical memory. English
l i k e food , d r i n k i n g water , s h e l t e r , c l o t h e s ,
nouns, verbs, adjectives, and adverbs are organized into syn-
b l a n k e t s , human r e s o u r c e s l i k e v o l u n t e e r s
onym sets, each representing one underlying lexicalized con-
r e s o u r c e s to build or support i n f r a s t r u c t u r e
cept. Different relations link the synonym sets [2].
l i k e t e n t s , water f i l t e r , power s u p p l y and
WordNet has long been regarded as a potentially useful
s o on . Messages i n f o r m i n g t h e a v a i l a b i l i t y
resource for query expansion in IR. However, it has met with
of t r a n s p o r t v e h i c l e s f o r a s s i s t i n g the
1
r e s o u r c e d i s t r i b u t i o n p r o c e s s would a l s o https://wordnet.princeton.edu/
limited success due to its tendency to include contextually the participants are matched, is generated using a “man-
unrelated synonyms for query words which are unrelated. ual run” where human assessors were given the same set of
One of the successful applications of WordNet in IR is found tweets and topics, and asked to identify all possible rele-
in [4] which uses the comprehensive WordNet thesaurus and vant tweets using a search engine (Indri). While judging the
its semantic relatedness measure modules to perform query participants’ runs, the track organizers arranged for a sec-
expansion on a document retrieval task. The authors ob- ond round of assessments to judge the relevance of tweets
tained a 7% improvement on retrieval effectiveness compare that are identified by the participants but were not identified
to the performance of using original query for search. [3] during the first round of human assessment.
combined terms obtained from three different resources, in- Results of our two runs are shown in Table 1. The table
cluding WordNet for use as expansion terms, Their method shows results for 4 runs, two of them are automatic and the
was tested on a TREC ad hoc test collection with impressive other two are semi automatic. In the automatic runs listed,
results. our submissions were placed third. The Precision@20 of the
In this experiment, we also use WordNet to carry out best automatic run result is 0.4357 where ours is 0.3786.
query expansion. WordNet is used as external resource to However our automatic run achieved the best MAP@1000
generate the synonyms for each topic. We limited the num- value of 0.1103, which is an increase of 27.93% relative to
ber of synonyms for each topic term to 20 maximum, some the best run. Our overall MAP is lower because we only
terms received less synonyms. submitted the top 1000 tweets for each topic while other
participants submitted more. We received first place for the
3.2.2 Experiment One semi-automatic method where our Precision@20 is 33.35%
Our first run (named dcu fmt16 1) uses our first auto- higher than the second place run. These numbers show that
matic method using WordNet query expansion. In this run, using WordNet to generate synonyms for topic terms is a
the following 4 steps were applied: positive way to carry out query expansion for this Microblog
task.
1. remove stop words from each topic;
2. use WordNet to generate the synonyms for each item 4. CONCLUSIONS AND FURTHER WORK
in every topic; For our submissions to the FIRE 2016 Microblog Track,
we employed WordNet as an external resource to carry out
3. these synonyms are used as expand terms and add query expansion by retrieving the synonyms of each topic
them back to each topic; term and using them as the additional query terms to re-
4. use the expanded topics as new topic to search again formulate each topic. We conducted two runs using this
(BM25 retrieval model is used for retrieval). method, an automatic run and a semi-automatic run. The
semi-automatic involved manual selection of relevant tweets
We use the combination of title and narrative fields of the from a first run and application of WordNet in a subsequent
topic in combination as the original topic. An example of retrieval stage. Our automatic run received the third place
an original topic and its extend version are shown in the among submission, however with the best MAP value. Our
Appendix. semi-automatic run obtained the overall first place. These
positive results show that when a topic is too general and
3.2.3 Experiment Two does not contain the necessary terms to match with rele-
Our second run (named dcu fmt16 2) is an semi-automatic vant documents, using WordNet as an external resource to
run which means that the manual selection is involved. This generate synonyms is a good way to make them more effec-
run was carried out using the following steps: tive. Potentially, using WordNet to retrieve hypernym or
hyponym for each topic term maybe another method worth
1. use the original topic to search and obtain a rank list; attempt for this task.
2. go through top 30 tweets from rank list to select 1-2
relevant tweets and to do query expansion. Number 5. ACKNOWLEDGEMENT
30 is selected to promise we could find at least one This research is supported by Science Foundation Ire-
relevant tweet for some topics. land in the ADAPT Centre (Grant 13/RC/2106) (www.
adaptcentre.ie) at Dublin City University.
3. remove the stopwords and duplicate terms from the
select tweets, add the rest term to original topic;
6. REFERENCES
4. then, applied WordNet again on the expanded topics
[1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
and find synonyms for these terms;
Microblog track: Information Extraction from
5. finally, add the synonyms to each expanded topic to Microblogs Posted during Disasters. In Working notes
generate new topics and use them as query to search of FIRE 2016 - Forum for Information Retrieval
again to obtain the final search results. Evaluation, Kolkata, India, December 7-10, 2016,
CEUR Workshop Proceedings. CEUR-WS.org, 2016.
3.2.4 Experimental Results [2] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
Since the aim of this track is to identify a set of tweets K. Miller. Wordnet: An on-line lexical database.
that are relevant to each topic, set-based evaluation metrics International Journal of Lexicography, 3:235–244, 1990.
of precision, recall, and MAP are used for evaluation. The [3] D. Pal, M. Mitra, and K. Datta. Improving Query
gold standard, against which the set of tweets identified by Expansion Using Wordnet. CoRR, abs/1309.4938, 2013.
Table 1: Our Results and Comparison with Others
Run Type Run Name Rank Precision@20 Recall@1000 MAP@1000 MAP
Automatic run iiest saptarashmi bandyopadhyay 1 1 0.4357 0.3420 0.0869 0.1125
Automatic run dcu fmt16 1 3 0.3786 0.3578 0.1103 0.1103
Semi-auto run dcu fmt16 2 1 0.4286 0.3445 0.0815 0.0815
Semi-auto run iitbhu fmt16 1 2 0.3214 0.2581 0.0670 0.0827
[4] J. Zhang, B. Deng, and X. Li. Concept Based Query
Expansion Using Wordnet. In Proceedings of the 2009
International e-Conference on Advanced Science and
Technology, AST ’09, pages 52–55, Washington, DC,
USA, 2009. IEEE Computer Society.
Appendix
Original topic:
Number: FMT6
WHAT WERE THE ACTIVITIES OF VARIOUS
NGOs / GOVERNMENT ORGANIZATIONS
Narrative: A relevant message must contain in-
formation about relief-related activities of different NGOs
and Government organizations in rescue and relief opera-
tion. Messages that contain information about the volun-
teers visiting different geographical locations would also be
relevant. However, messages that do not contain the name of
any NGO / Government organization would not be relevant.
Expanded topic:
Number: FMT1
were activities assorted respective several diverse ver-
satile various NGOs government organization organisation
arrangement system administration governance governing
body establishment brass constitution formation organiza-
tions a relevant message mustiness moldiness must incor-
porate comprise hold bear carry control hold in check curb
moderate take turn back arrest stop hold back contain infor-
mation about relief related activities different organization
organisation arrangement system administration governance
governing body establishment brass constitution formation
organizations indium atomic number four9 indiana hoosier
state inwards inward in deliverance delivery saving deliver
rescue relief operation. message content subject matter sub-
stance messages that incorporate comprise hold bear carry
control hold in check curb moderate take turn back arrest
stop hold back contain information about volunteers visit
see travel call in call inspect inflict bring down impose chew
fat shoot breeze chat confabulate confab chitchat chit chat
chatter chaffer natter gossip jaw claver visiting different ge-
ographical location placement locating position positioning
emplacement localization localisation fix locations would be-
sides too likewise well also relevant. message content subject
matter substance messages that do not incorporate com-
prise hold bear carry control hold in check curb moderate
take turn back arrest stop hold back contain name whatever
whatsoever any NGO government organization would not
relevant