IIT BHU at FIRE 2018 IRMiDis Track - Obtaining Factual Tweets During Natural Disasters Harshit Mehrotra1 and Sukomal Pal1 Department of Computer Science and Engineering, Indian Institute of Technology (BHU) Varanasi - 221005 harshit.mehrotra.cse15@iitbhu.ac.in Abstract. This paper presents details of the work done by the team of IIT (BHU) Varanasi for the IRMiDis track in FIRE 2018. The task involved classifying tweets posted during a disaster into those which are fact-checkable or factual and which are not, and also match these tweets to relevant news articles. Methodologies had to be developed in context of the 2015 Nepal Earthquake. Keywords: Information retrieval, microblogs, disaster, word embeddings 1 Introduction - Tasks and Data With the increasing use of social media, the domains of its impact are also changing rapidly. In the recent past, people and media houses have resorted to social media platforms like Twitter and Facebook to post sentiments, informa- tion, need, resource availability, news updates etc. These can be a very useful source of relief relevant information. However, a lot of the information in the stream may be useless, over-stated or even contain rumors. The IRMiDis track in FIRE 2018 [1] posed the following tasks in this context: – Identifying factual or fact-checkable tweets: Developing methodologies to segregate fact-checkable tweets from the huge stream of twitter microblogs to help in relief and rehab operations. Around 80 sample fact-checkable tweets are provided to develop the methodology which is later evaluated on around 50,000 test tweets. – Identification of supporting news articles for fact-checkable tweets: A fact-checkable tweet is said to be supported/verified by a news article if the same fact is reported by both the media and the tweet. Each fact-checkable tweet has to be matched with its relevant news article(s) in a collection of nearly 6,000 articles. Also, the line in the article indicating the relevance has to be identified. We submitted one run in which the methodology for the first sub-task was fully automatic and that for the second one was semi-automatic. 2 Harshit Mehrotra and Sukomal Pal 2 Methodology 2.1 Sub-Task 1 The methodology for the first sub-task i.e. identification of fact-checkable tweets is fully automatic in both, query generation and searching. The key steps are indicated as follows: 1. Pre-processing of all tweets by lower-casing, removal of stopwords, hash- tags and addressing and finally stemming using porter stemmer. The term tweet hereafter refers to the pre-processed tweet. 2. Creating a TF-IDF based ranked list of terms in the reference set of 84 tweets. Only those terms are considered that occur in the reference set more than once. We call this set of terms R with the TF-IDF score function being T. 3. A word2vec word embedding model is trained on the entire set of 50,000 tweets. 4. Each test tweet is now attributed to its corresponding feature vector that is formed by an arithmetic mean of the sum of the individual terms embeddings. 5. To form the reference feature (V ) vector against which they will be matched, we use the following weighted mean: P|R| i=1 T (Ri )E(Ri ) V = P|R| (1) i=1 T (Ri ) E is the embeddings function. 6. Each tweet is then evaluated for its cosine similarity (= 1−cosine dis- tance) with V . The similarity is normalized by dividing with the maximum similarity value obtained. 7. Now amongst the highest probability tweets, we have to separate the negative (non-factual ones). For this we form two word sets: (a) The first word set P is formed out of the terms in the reference dataset of 84 tweets which occur in the dataset more than once. (b) The second word set N is prepared as follows. Tweets having similarity less than 0.80 are taken and their terms are arranged in decreasing order of their frequency in this subset. The top 500 words in this arrangement comprise N . 8. The value 0.80 is decided by seeing the minimum similarity value of a tweet in the reference data set. 9. Since, we considered tweets with similarity less than 0.80 for negative tweets term selection, we now test the tweets with similarity greater than or equal to 0.80 against P and N . If no term of N and more than one terms of P are present in the tweet, it is classified as positive (factual). 10. The probabilities are normalized to the range (0,1] to give the factuality scores. Title Suppressed Due to Excessive Length 3 2.2 Sub-Task 2 The methodology for the second sub-task is manual in query generation and automatic in searching, scoring, using the Java based text search library Lucene. Details of the constituent steps are as follows: 1. The news articles are pre-processed in the same way as tweets are in sub- task 1. 2. The headline and first 3 sentences of each news articles are combined. This creates one testing document for each news article. 3. Now each pre-processed tweet is used as a query to match with the testing documents of the news articles. This done using Lucene and the score of the best matching document is seen for each tweet. 4. If this score is more than 0.30, the corresponding news article is said to be matching the tweet, otherwise no relevant news article is said to be found for the tweet. 5. To find the matching sentence, the tweet as a query is matched with each sentence of the relevant news article. The sentence with the highest score is returned as the answer. 3 Results The results on the two sub-tasks, based on different metrics are indicated in Table 1 and 2. Table 1. Results on Sub-Task 1 Rank Run Type Precision@100 Recall@100 MAP@100 MAP Overall NDCG@100 NDCG Overall 5 Automatic 0.9300 0.1938 0.0709 0.1568 0.8645 0.4532 Table 2. Results on Sub-Task 2 Rank Run Type Precision@N Recall F-Score 1 Semi-automatic 0.9378 0.9756 0.9563 4 Possible Improvements Depending on the kind of data, a sentiment analysis module can be augmented in the classification pipeline. However since such system should be ready to use for a disaster when it happens, the weight of such an additional module can be found as a hyperparameter by studying data from such incidents that have already occurred. 4 Harshit Mehrotra and Sukomal Pal References 1. Basu, M., Ghosh, S., Ghosh, K.: Overview of the FIRE 2018 track: Information Retrieval from Microblogs during Disasters (IRMiDis). In: Proceedings of FIRE 2018 - Forum for Information Retrieval Evaluation (December 2018)