Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters Saptarshi Ghosh Kripabandhu Ghosh Department of CST, Indian Statistical Institute, IIEST Shibpur, India Kolkata, India sghosh@cs.iiests.ac.in kripa.ghosh@gmail.com ABSTRACT participants, along with a set of seven practical information The FIRE 2016 Microblog track focused on retrieval of mi- needs that are faced in a disaster situation by the agencies croblogs (tweets posted on Twitter) during disaster events. responding to the disaster. Details of the collection are dis- A collection of about 50,000 microblogs posted during a re- cussed in Section 2. The task was to retrieve microblogs cent disaster event was made available to the participants, relevant to the information needs (see Section 3. 10 teams along with a set of seven practical information needs during participated in the track, submitting a total of 15 runs that a disaster situation. The task was to retrieve microblogs are described in Section 4). The runs were evaluated against relevant to these needs. 10 teams participated in the task, a gold standard developed by human assessors, using stan- submitting a total of 15 runs. The task resulted in com- dard measures like Precision, Recall, and MAP. parison among performances of various microblog retrieval strategies over a benchmark collection, and brought out the 2. THE TEST COLLECTION challenges in microblog retrieval. In this section, we describe how the test collection for the Microblog track was developed. Following the Cranfield CCS Concepts style [1], we describe the creation of topics (information •Information systems → Query reformulation; needs), document set (here, microblogs or tweets) collec- tion and relevance assessment to prepare the gold standard Keywords necessary for evaluation of IR methodologies. FIRE 2016; Microblog track; Microblog retrieval; Disaster 2.1 Topics for retrieval In this track, our objective was to develop a test collec- 1. INTRODUCTION tion to evaluate IR methodologies for extracting informa- Microblogging sites such as Twitter (https://twitter.com) tion (from microblogs) that can potentially help responding have become important sources of situational information agencies to respond to a disaster situation such as an earth- during disaster events, such as earthquakes, floods, and hur- quake or a flood. To this end, we consulted members of ricanes [2, 11]. On such sites, a lot of content is posted some NGOs who regularly work in disaster-affected regions during disaster events (in the order of thousands to mil- – such as, Doctors For You (http://doctorsforyou.org/) and lions of tweets), and the important situational information SPADE (http://www.spadeindia.org/) – to know what are is usually immersed in large amounts of general conversa- the typical information requirements during a disaster re- tional content, e.g., sympathy for the victims of the disaster. lief operation. They identified certain information needs Hence, automated IR techniques are needed to retrieve spe- such as what resources are required / available (especially cific types of situational information from the large amount medical resources), what infrastructure damages are being of text. reported, the situation at specific geographical locations, the There have been few prior attempts to develop IR tech- ongoing activities of various NGOs and government agen- niques over microblogs posted during disasters, but there has cies (so that the operations of various responding agencies been little effort till now to develop a benchmark dataset / can be coordinated), and so on. Based on their feedback, test collection using which various microblog retrieval method- we identified seven topics on which information needs to be ologies can be compared and evaluated. The objectives of retrieved during a disaster. the FIRE 2016 Microblog track are two-fold – (i) to develop Table 1 states the seven topics which we have developed a test collection of microblogs posted during a disaster situa- as a part of the test collection. These topics are written tion, which can serve as a benchmark dataset for evaluation in the format conventionally used for TREC topics.1 Each of microblog retrieval methodologies, and (ii) to evaluate and topic contains an identifying number (num), a textual repre- compare the performance of various IR methodologies over sentation of the information need (title), a brief description the test collection. The track is inspired by the TREC Mi- (desc) of the same and a more detailed narrative (narr) ex- croblog Track [4] which aims to evaluate microblog retrieval plaining what type of documents (tweets) will be considered strategies in general. In contrast, the FIRE 2016 Microblog relevant to the topic, and what type of tweets would not be Track focuses on microblog retrieval in a disaster situation. considered relevant. In this track, a collection of about 50,000 microblogs posted 1 during a recent disaster event was made available to the trec.nist.gov/pubs/trec6/papers/overview.ps.gz Number: FMT1 What resources were available <desc> Identify the messages which describe the availability of some resources. <narr> A relevant message must mention the availability of some resource like food, drinking water, shelter, clothes, blankets, human resources like volunteers, resources to build or support infrastructure, like tents, water filter, power supply and so on. Messages informing the availability of transport vehicles for assisting the resource distribution process would also be relevant. However, generalized statements without reference to any resource or messages asking for donation of money would not be relevant. <num> Number: FMT2 <title> What resources were required <desc> Identify the messages which describe the requirement or need of some resources. <narr> A relevant message must mention the requirement / need of some resource like food, water, shelter, clothes, blankets, human resources like volunteers, resources to build or support infrastructure like tents, water filter, power supply, and so on. A message informing the requirement of transport vehicles assisting resource distribution process would also be relevant. However, generalized statements without reference to any particular resource, or messages asking for donation of money would not be relevant. <num> Number: FMT3 <title> What medical resources were available <desc> Identify the messages which give some information about availability of medicines and other medical resources. <narr> A relevant message must mention the availability of some medical resource like medicines, medical equipments, blood, supplementary food items (e.g., milk for infants), human resources like doctors/staff and resources to build or support medical infrastructure like tents, water filter, power supply, ambulance, etc. Generalized statements without reference to medical resources would not be relevant. <num> Number: FMT4 <title> What medical resources were required <desc> Identify the messages which describe the requirement of some medicine or other medical resources. <narr> A relevant message must mention the requirement of some medical resource like medicines, medical equipments, supple- mentary food items, blood, human resources like doctors/staff and resources to build or support medical infrastructure like tents, water filter, power supply, ambulance, etc. Generalized statements without reference to medical resources would not be relevant. <num> Number: FMT5 <title> What were the requirements / availability of resources at specific locations <desc> Identify the messages which describe the requirement or availability of resources at some particular geographical location. <narr> A relevant message must mention both the requirement or availability of some resource, (e.g., human resources like volunteers/medical staff, food, water, shelter, medical resources, tents, power supply) as well as a particular geographical location. Messages containing only the requirement / availability of some resource, without mentioning a geographical location would not be relevant. <num> Number: FMT6 <title> What were the activities of various NGOs / Government organizations <desc> Identify the messages which describe on-ground activities of different NGOs and Government organizations. <narr> A relevant message must contain information about relief-related activities of different NGOs and Government organizations in rescue and relief operation. Messages that contain information about the volunteers visiting different geographical locations would also be relevant. However, messages that do not contain the name of any NGO / Government organization would not be relevant. <num> Number: FMT7 <title> What infrastructure damage and restoration were being reported <desc> Identify the messages which contain information related to infrastructure damage or restoration. <narr> A relevant message must mention the damage or restoration of some specific infrastructure resources, such as structures (e.g., dams, houses, mobile tower), communication infrastructure (e.g., roads, runways, railway), electricity, mobile or Internet connectivity, etc. Generalized statements without reference to infrastructure resources would not be relevant. Table 1: The seven topics (information needs) used in the track. Each topic is written following the format conventionally used in TREC tracks (containing a number, title, description and narrative). The task is to retrieve microblogs relevant to these topics. 2.2 Tweet dataset standard English stopwords and URLs), and the similarity We collected a large set of tweets related to the devastating between two tweets was measured as the Jaccard similar- earthquake that occurred in Nepal and parts of India on ity between the two corresponding bags (sets) of words. If 25th April 2015.2 We collected tweets using the Twitter the Jaccard similarity between two tweets was found to be Search API [10], using the keyword ‘nepal’, that were posted higher than a threshold value (0.7), the two tweets were during the two weeks following the earthquake. We collected considered near-duplicates, and only the longer tweet (po- only tweets in English (based on language identification by tentially more informative) was retained in the collection. Twitter itself), and collected about 100K tweets in total. After removing duplicates and near-duplicates, we obtained Tweets often contain duplicates and near-duplicates since a set of 50,068 tweets, which was used as the test collection the same information is frequently retweeted / re-posted by for the track. multiple users [9]. However, duplicates are not desirable in a test collection for IR, since the presence of duplicates 2.3 Developing gold standard for retrieval can result in over-estimation of the performance of an IR Evaluation of any IR methodology requires a gold standard methodology. Additionally, the presence of duplicate doc- containing the documents that are actually relevant to the uments also creates information overload for human anno- topics. As is the standard procedure, we used human anno- tators while developing the gold standard [3]. Hence, we tators to develop this gold standard. A set of three human removed duplicate and near-duplicate tweets using a simpli- annotators were used, each of whom is proficient in English fied version of the methodologies discussed in [9], as follows. and is a regular user of Twitter, and has prior experience of Each tweet was considered as a bag of words (excluding working with social media content posted during disasters. The development of gold standard involved three phases. 2 https://en.wikipedia.org/wiki/April 2015 Nepal earthquake Phase 1: Each annotator was given the set of 50,068 tweets, and the seven topics (in TREC format, as stated in Table 1). that our approach, where the annotators viewed the entire Each annotator was asked to identify all tweets relevant to dataset instead of a relatively small pool, is likely to be more each topic, independently, i.e., without consulting the other robust, and is expected to have resulted in development of annotators. To help the annotators, the tweets were indexed a more complete gold standard which is irrespective of the using the Indri IR system [8], which helped the annotators to performance of any IR methodology. search for tweets containing specific terms. For each topic, the annotators were asked to think of appropriate search- 3. DESCRIPTION OF THE TASK terms, retrieve tweets containing those search terms (using The participants were given the tweet collection and the Indri), and to judge the relevance of the retrieved tweets. seven topics described earlier. It can be noted that the Twit- After the first phase, we observed that the set of tweets ter terms and conditions prohibit direct public sharing of identified to be relevant to the same topic by different an- tweets. Hence, only the tweet-ids4 of the tweets were dis- notators, was considerably different. This difference was be- tributed among the participants, along with a Python script cause different annotators used different search-terms to re- using which the tweets can be downloaded via the Twitter trieve tweets.3 Hence, we conducted a second phase. API. The participants were invited to develop IR methodologies Phase 2: In this phase, for a particular topic, all tweets for retrieving tweets relevant to the seven topics. The partic- that were judged relevant by at least one annotator (in the ipants were asked to submit a ranked list of tweets that they first phase) were considered. The decision whether a tweet judge relevant to each topic. The ranked list was evaluated is relevant to a topic was finalised through discussion among based on the gold standard (developed as described earlier) all the annotators and mutual agreement. using the following measures: (i) Precision at 20 (Prec@20), i.e., what fraction of the top-ranked 20 results are actually Phase 3: The third phase used standard pooling [7] (as relevant according to the gold standard, (ii) Recall at 1000 commonly done in TREC tracks) – the top 30 results of all (Recall@1000), i.e., what fraction of all tweets relevant to a the submitted runs were pooled (separately for each topic), topic (as identified in the gold standard) is present among and judged by the annotators. In this phase, all annotators the top-ranked 1000 results, (iii) Mean Average Precision at were judging a common set of tweets, hence inter-annotator 1000 (MAP@1000), and (iv) Overall MAP considering the agreement could be measured. There was agreement among full retrieved ranked list. Out of these, we only report the all annotators for over 90% of the tweets; for the rest, the Prec@20 and MAP measures (in the next section). relevance was decided through discussion among all the an- The track invited three types of methodologies – (i) Auto- notators and mutual agreement. matic, where both query formulation and retrieval are auto- mated, and (ii) Semi-automatic, where manual intervention The final gold standard contains the following number of is involved in the query formulation stage (but not in the re- tweets judged relevant to the seven topics – FMT1: 589, trieval stage), and (iii) Manual, where manual intervention FMT2: 301, FMT3: 334, FMT4: 112, FMT5: 189, FMT6: is involved in both query formulation and retrieval stages. 378, FMT7: 254. 15 runs were submitted by the participants, out of which, 2.4 Insights from the gold standard develop- one run was fully automatic, while the others were semi- ment process automatic. The methodologies are summarized and com- pared in the next section. Through the process described above, we understood that for any of the topics, there are several tweets which are def- initely relevant to the topic, but which were difficult to re- 4. METHODOLOGIES trieve even for human annotators. This is evident from the Ten teams participated in the FIRE 2016 Microblog track. fact that, many of the relevant tweets could initially be re- A summary of the methodologies used by each team is given trieved by only one out of the three annotators (in the first in the next sub-section. Table 2 shows the evaluation perfor- phase), but when the tweets were shown to the other anno- mance of each submitted run, along with a brief summary. tators (in the second phase), they unanimously agreed that For each type, the runs are arranged in the decreasing order the tweet was relevant. These observations highlight the of the primary measure, i.e., Precision@20. In case of a tie, challenges in microblog retrieval. the arrangement is done in the decreasing order of MAP. Note that our approach for developing the gold standard is different from that used in TREC tracks, where the gold 4.1 Method summary standard is usually developed by pooling few top-ranked We now summarize the methodologies adopted in the sub- documents retrieved by different submitted systems, and mitted runs. then annotating these top-ranked documents [7]. In other • dcu fmt16: This team participated from ADAPT Cen- words, only the third phase (as described above) is applied tre, School of Computing, Dublin City University, Ire- in TREC tracks. land. It used WordNet5 to perform synonym-based Given that it is challenging to identify many of the tweets query expansion and submitted the following two runs: relevant to a topic (as discussed above), annotating only a relatively small pool of documents retrieved by IR method- 1. dcu fmt16 1: This is an Automatic run (i.e. no ologies has the potential risk of missing many of the relevant manual step involved). First, the words in <ti- documents which are more difficult to retrieve. We believe tle> and <narr> were considered, from which the 3 4 Since the different annotators retrieved and judged very Twitter assigns a unique numeric id to each tweet, called different sets of tweets, it is not meaningful to report inter- the tweet-id. 5 annotator agreement in this case. https://wordnet.princeton.edu/ Run Id Precision@20 MAP Type Method summary dcu fmt16 1 0.3786 0.1103 Automatic WordNet, Query Expansion iiest saptarashmi bandyopadhyay 1 0.4357 0.1125 Semi-automatic Correlation, NER, Word2Vec JU NLP 1 0.4357 0.1079 Semi-automatic WordNet, Query Expansion, NER, GloVe dcu fmt16 2 0.4286 0.0815 Semi-automatic WordNet, Query Expansion, Relevance Feedback JU NLP 2 0.3714 0.0881 Semi-automatic WordNet, Query Expansion, NER, GloVe, word bags split JU NLP 3 0.3714 0.0881 Semi-automatic WordNet, Query Expansion, NER, GloVe, word bags split iitbhu fmt16 1 0.3214 0.0827 Semi-automatic Lucene default model relevancer ru nl 0.3143 0.0406 Semi-automatic Relevancer system, Clustering Manual labelling, Naive Bayes classification daiict irlab 1 0.3143 0.0275 Semi-automatic Word2vec, Query Expansion, equal term weight daiict irlab 2 0.3000 0.0250 Semi-automatic Word2vec, Query Expansion, unequal term weights, WordNet trish iiest ss 0.0929 0.0203 Semi-automatic Word-overlap, POS tagging trish iiest ws 0.0786 0.0099 Semi-automatic WordNet, wup score, POS tagging nita nitmz 1 0.0583 0.0031 Semi-automatic Apache Nutch 0.9, query segmentation, result merging Helpingtech 1 (on 5 topics) 0.7700 0.2208 Semi-automatic Entity and action verbs relationships, Temporal Importance GANJI 1, GANJI 2, 0.8500 0.2420 Semi-automatic Keyword extraction, Part-of-speech tagger, GANJI 3 (Combined) (on 3 topics) Word2Vec, WordNet, Terrier, Retrieval, Classification, SVM Table 2: Comparison among all the submitted runs. Runs which attempted retrieval only for a subset of the topics are listed separately at the end of the table. stopwords were removed. Thus the initial query manually selected on which a number of tools was formed. Then, for each word in the query, (e.g., PyDictionary, NodeBox toolkit etc.) were the synonyms were added using WordNet, result- used to find the corresponding synonyms, inflec- ing in the expanded query. Retrieval was done tional variants etc. The bag of words for each from this expanded query using the BM25 model topic was further converted into a vector using [6]. Word2Vec package.7 Finally, the relevance score 2. dcu fmt16 2: This is a Semi-automatic run (i.e. was calculated from the correlation between the manual step was involved). First an initial ranked vector representations of the topic word bags and list was generated using the original topic. From the tweet text. the top 30 tweets, 1-2 relevant tweets were manu- ally identified and query expansion was done from • JU NLP: This team participated from Jadavpur Uni- these relevant tweets. The expanded query was versity, India. It submitted three Semi-automatic runs further expanded using WordNet just as done for described as below: dcu fmt16 1. This final expanded query was used for retrieval. 1. JU NLP 1: This run was generated by using word embeddings. For each topic, relevant words were • iiest saptarashmi bandyopadhyay: This team par- manually chosen and expanded using the synonyms ticipated from Indian Institute of Engineering Science obtained from NLTK WordNet toolkit. In addi- and Technology, Shibpur, India. It submitted one Semi- tion, past, past participle and present continu- automatic run described below: ous forms of verbs were obtained using the Node- Box library for Python. For the topics FMT5 – iiest saptarashmi bandyopadhyay 1: Correlation be- and FMT6, location and organization informa- tween the topic words and the tweet was cal- tion was extracted using Stanford NER tagger. culated and this value determined the relevance GloVe[5] model was trained on the twitter collec- score for a given topic-tweet pair. The Stanford tion. A tweet vector, as well as, a query vector NER tagger6 was used to identify the LOCA- was formed by taking the normalized summation TION, ORGANIZATION and PERSON names in of the vector (obtained from GloVe) of the con- the tweets. For each topic, some keywords were stituent words. Then for each query-tweet pair, 6 7 nlp.stanford.edu/software/Stanford-ner-2015-04-20.zip https://deeplearning4j.org/word2vec the similarity score was calculated by the cosine – nita nitmz 1: This run was generated on Apache similarity of the corresponding vectors. Nutch 0.9. Search was done using the different 2. JU NLP 2: This run is similar to JU NLP 1 ex- combination of words present in the query. The cept that here word bags were split categorically results obtained from different combinations of and average similarity between the tweet vector query were merged. and the split topic vectors was calculated. • Helpingtech: This team participated from Indian In- 3. JU NLP 3: This is identical to JU NLP 2. stitute of Technology, Patna, Bihar, India and sub- mitted the following Semi-automatic run (on 5 topics • iitbhu fmt16: This team participated from Depart- only): ment of Computer Science and Engineering, Indian In- stitute of Technology (BHU) Varanasi, India. It sub- – Helpingtech 1: For each query, relationships enti- mitted one Semi-automatic run – iitbhu fmt16 1 de- ties and action verbs were defined through manual scribed as follows: inspection. The ranking score was calculated on the basis of the presence of these pre-defined re- – iitbhu fmt16 1: The Lucene8 default similarity model, lationships in the tweet for a given query. More which combines Vector Space Model (VSM) and importance was given to a tweet which indicated probabilistic models (e.g., BM25), was used to immediate action than a one which indicated a generate the run. StandardAnalyzer, which han- proposed action for future. dled names and email address and lowercased each token, and removed stopwords and punctuations, • GANJI: This team participated from Èvora Univer- was used. The query formulation stage involved sity, Portugal. It submitted three retrieval results (GANJI 1, manual intervention. GANJI 2, GANJI 3) for the first three topics only us- ing Semi-automatic methodology, described below: • daiict irlab: This team participated from DAIICT, Gandhinagar, India and LDRP, Gandhinagar, India. – GANJI 1, GANJI 2, GANJI 3 (combined): First, It submitted two Semi-automatic runs described as fol- keyword extraction was done using Part-of-speech lows: tagger, Word2Vec (to obtain the nouns) and Word- Net (to obtain the verbs). Then, retrieval was 1. daiict irlab 1: This run was generated using query performed on Terrier10 using the BM25 model. expansion, where the 5 similar words and hash- Finally, SVM classifier was used to classify the tags from the Word2vec model, trained on the retrieved tweets into available, required and other tweet corpus, were added to the original query. classes. Equal weight was assigned to each term. 2. daiict irlab 2: This run was generated in the same • relevancer ru nl: This team participated from Rad- way as daiict irlab 1 except that different weights boud University, the Netherlands and submitted the were assigned to the expanded terms than the following Semi-automatic run: original terms. More weights were assigned to – relevancer ru nl: This run was produced by a tool the words like required and available. These terms Relevancer. After a pre-processing step, the tweet were also expanded using WordNet. collection was clustered to identify coherent clus- ters. Each such cluster was manually labelled by • trish iiest: This team participated from Indian Insti- some experts as relevant or non-relevant. This tute of Engineering Science and Technology, Shibpur, training data was used for Naive Bayes based clas- India. It submitted two Semi-automatic runs described sification. For each topic, the test tweets pre- below: dicted as relevant by the classifier were submit- 1. trish iiest ss: The similarity score between a query ted. and a tweet is the word-overlap between them, normalized by the query length. In each topic, the 5. CONCLUSION AND FUTURE DIRECTIONS nouns, identified by the Stanford Part-Of-Speech The FIRE 2016 Microblog track successfully created a bench- Tagger, were selected to form the query. In addi- mark collection of microblogs posted during disaster events, tion, more weight is assigned on words like avail- and compared the performance of various IR methodologies ability or requirement. over the collection. 2. trish iiest ws: For this run, wup 9 score is calcu- In subsequent years, we hope to conduct extended versions lated on the synsets of each term obtained from of the Microblog track, where the following extensions can WordNet. be considered: • nita nitmz: This team participated from National In- • Instead of just considering binary relevance (where a stitute of Technology, Agartala, India and National tweet is either relevant to a topic or not), graded rele- Institute of Technology, Mizoram. It submitted one vance can be considered, e.g., based on factors like how Semi-supervised run described as below: important or actionable the information contained in 8 the tweet is, how useful the tweet is likely to be to the https://lucene.apache.org/(2016,August20) agencies responding to the disaster, and so on. 9 http://search.cpan.org/dist/WordNet-Similarity/lib/ 10 WordNet/Similarity/wup.pm http://terrier.org • The challenge in this year’s track considered a static [10] Twitter Search API. set of microblog. But in reality, microblogs are ob- https://dev.twitter.com/rest/public/search. tained in a continuous stream. The challenge can be [11] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. extended to retrieve relevant microblogs dynamically, Microblogging During Two Natural Hazards Events: e.g., as and when they are posted. What Twitter May Contribute to Situational Awareness. In Proc. ACM SIGCHI, 2010. It can be noted that even the best performing method submitted in the track achieved a relatively low MAP score of 0.24 (considering only three topics), which highlights the difficulty and challenges in microblog retrieval during a dis- aster situation. We hope that the test collection developed in this track will help development of better models for mi- croblog retrieval in future. Acknowledgements The track organizers thank all the participants for their in- terest in this track. We also acknowledge our assessors, no- tably Moumita Basu and Somenath Das, for their help in developing the gold standard for the test collection. We also thank the FIRE 2016 organizers for their support in orga- nizing the track. 6. REFERENCES [1] C. Cleverdon. The cranfield tests on index language devices. In K. Sparck Jones and P. Willett, editors, Readings in Information Retrieval, pages 47–59. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. [2] M. Imran, C. Castillo, F. Diaz, and S. Vieweg. Processing Social Media Messages in Mass Emergency: A Survey. ACM Computing Surveys, 47(4):67:1–67:38, June 2015. [3] J. Lin, M. Efron, Y. Wang, G. Sherman, and E. Voorhees. Overview of the TREC-2015 Microblog Track. Available at: https://cs.uwaterloo.ca/ ˜jimmylin/publications/Lin etal TREC2015.pdf, 2015. [4] I. Ounis, C. Macdonald, J. Lin, and I. Soboroff. Overview of the TREC-2011 Microblog Track. Available at: http://trec.nist.gov/pubs/trec20/ papers/MICROBLOG.OVERVIEW.pdf, 2011. [5] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. [6] S. E. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. [7] K. Sparck Jones and C. van Rijsbergen. Report on the need for and provision of an ideal information retrieval test collection. Tech. Rep. 5266, Computer Laboratory, University of Cambridge, UK, 1975. [8] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-based search engine for complex queries. In Proc. ICIA. Available at: http://www.lemurproject.org/indri/, 2004. [9] K. Tao, F. Abel, C. Hauff, G.-J. Houben, and U. Gadiraju. Groundhog Day: Near-duplicate Detection on Twitter. In Proc. World Wide Web (WWW), 2013.