=Paper=
{{Paper
|id=Vol-2266/T1-6
|storemode=property
|title=A String Kernel based Information Retrieval Approach for Tweet-validation through News
|pdfUrl=https://ceur-ws.org/Vol-2266/T1-6.pdf
|volume=Vol-2266
|authors=Muhammad Rafi,Fizza Abid,Anum Mirza,Hamza Mustafa Khan
|dblpUrl=https://dblp.org/rec/conf/fire/RafiAMK18
}}
==A String Kernel based Information Retrieval Approach for Tweet-validation through News==
A string kernel based Information Retrieval approach for tweet-validation through NEWS Muhammad Ra1 , Fizza Abid1 , Anum Mirza1 , and Hamza Mustafa Khan1 National University Of Computer & Emerging Sciences - FAST, Karachi, Pakistan muhammad.rafi@nu.edu.pk http://www.khi.nu.edu.pk Abstract. Micro-blogging websites like Twitter are very popular among internet users, over 100 million tweets are posted every day. The web- sites are active in ecient dissemination of information pertinent to any emergency like ood and earthquake. Recent research proved that these platforms can eectively be used for monitoring, evaluations and coordi- nating relief operations in such situations. One of the very critical issue of such applications is to identifying the validity of these posts automat- ically during the emergency situation as factual information or rumors. The idea is to verify the tweet from some other authentic news source. Fo- rum for Information Retrieval Evaluation (FIRE 2018) edition included a shared task for Information Retrieval from Microblogs during Disasters (IRMiDis). The subtask 1, is identifying the tweets from their content as fact or fact-checkable tweets. The main idea of this task is to iden- tify the validity of the tweets so that the rumors or baseless situational tweets can be ltered from the context of monitoring and response activ- ity of such challenging emergency situations. The paper proposes a string kernel based information retrieval approach for tweet-validation through NEWS. Our approach is based on two-steps information retrieval. In rst step, we consider a given tweet as a query and nd a best match- ing headline using Aho-Corasick algorithm with a score greater than . In the next step, we matched the content of the news with the tweet using cosine similarity, if this score is also greater than , it means that we have a supporting news item for the given tweet. We learn the values of α=0.11 and β =0.25 through experimentation. Our proposed approach performed second best in the competition with overall NDGC 0.667. Keywords: String kernel approach · Aho-Corasick · Vector space model · tweet validation 1 Introduction Twitter is considered as a chief communication platform in critical and disaster situations according to the scholars and the practitioners. It has been observed that during hurricanes and real time recovery events, twitter is the most used platform. The signicance of twitter cannot be denied as it alerts individuals and help them to recover from such disasters. In crisis situations, through twitter,the 2 Muhammad Ra, Fizza Abid, Anum Mirza, and Hamza Mustafa Khan voice of the common people reach to the higher authorities and they will be able to react in best possible manner[6]. However, the tweets can be classied as valid or rumors. Automatic rumor detection is very much dependent on authentic source of information to be met for surety of information. Other challenges include semantic information process- ing, variational or piecewise information handling, biasness of the information and initiation of the information. At times, rumors are created intentionally to mislead the audience, in such situations, it is complicated to understand the se- mantics completely. Rumors are of multiple types depending upon the style and language; thus, dierent algorithms will be required [5]. Otherwise, algorithms trained on limited data will fail due to training biases. In order to detect valid tweets, the hashtags/keywords will be extracted from the tweets and then they will be compared with the news sources because news sources are authentic and they will easily validate the tweets. NEWS items are the most credible source from which the authenticity and validity of the tweets can be checked. Only if the tweet is based on reality and is considered as a fact, only then it will be discussed in the news. This approach is simple yet eective. Forum for Information Retrieval Evaluation (FIRE 2018) edition launched a shared task for Information Retrieval from Microblogs during Disasters (IR- MiDis). The main approach of subtask 1 is to identify the validity of the tweets so that the rumors or unauthentic situational tweets can be ltered from the context of monitoring and response activity of such challenging emergency situ- ations [1]. 2 Literature Review A variety of communication platforms are evolved with increasing demand of in- ternet enabled communication. Conventional media outlets; for instance, news- papers, radio, and TV for the news, are no more use in the modern era. Nowa- days, one of the platform i.e. Twitter has been widely used during a catas- trophic situation, such as natural disasters, hurricanes, earthquakes etc .Twitter was proven eective during last many catastrophic events for dissemination of timely information, relief operations monitoring, and response to it. Twitter also possess the tendency to enhance survival during Tornado-related disasters [4]. One of the trivial approach for tweet validation is to see whether this informa- tion is available from some authentic news source. Motivated from this thought, we proposed a two-fold string kernel based approach for the problem. The rst part of one approach nd the similarity of tweet and news using Aho-Corasick algorithm. This algorithm has been utilized for matching keywords/hashtags extracted from tweets with the headlines of the news titles. This string kernel based algorithm is extremely benecial as its searching phase is straightforward and target each and every occurrence of string [8]. The algorithm creates a nite state machine between various internal nodes. This algorithm is extremely fast and accurate as backtracking is not required [7] [9]. A string kernel based IR approach for tweet-validation through NEWS 3 The text is transformed into the vector space model so as to later compute cosine similarity. Vector space model transforms the text (each distinct term) into the vector. For text, the basic idea of a vector space model is to consider each individual term as its own dimension. If we have a document D which has words of length L, then wi is the ith word in D, where I [1...M] i [1...M]. Moreover, the group of words in wi are called vocabulary or the term space which is denoted by V. With vector space model (VSM), it is simple to measure similarity between two documents. It is also utilized for document encoding (tf-idf)[3][10]. Later, to compute similarity between the tweet and the news content, the co- sine similarity measure has been used. Cosine similarity is the angle between two associated vectors. In information retrieval and related topics, cosine similarity is a commonly used metric. In this metric, text is represented as a vector of terms and the similarity between two texts is obtained from cosine value between two vectors of terms [2][3]. In this metric, text is represented as a vector of terms and the similarity between two texts is obtained from cosine value between two vectors of terms [10]. It was a challenge to learn the accurate value of alpha (similarity between tweet keywords/hashtag and the headlines) and β (cosine similarity between tweet and news content). We used hit and trial method to learn the accurate value of alpha and beta. On α=0.11 & β =0.25, the results were satisfactory. 3 Proposed Approach The problem is formulated as Information retrieval problem. Given a tweet, we look for the provided news items, to see whether it has some supporting news or not. There are two distinct parts of a news item (i) NEWS Title or heading and (ii) NEWS content. Our approach uses two- fold similarity measures for identifying factual tweets. At rst we compute a string based similarity form tweet and NEWS heading using Aho-Corasick algorithm for string-searching, the complexity of this approach is linear as NEWS headings are short text, it is quite quick. All the news items that have a similarity value higher than α (a xed value we learn from the data), we compute the similarity of NEWS content using simple vector based cosine similarity. If a tweet has a supporting news item(s) and it has a similarity score higher than beta (another xed parameter). The news item is retrieved as supporting, we arrange all supporting news with decreasing order of similarity as rank scores. The algorithm is given below. It takes as input T: Set of tweet= t1 , t2 ,...., tn and N: set of NEWS= n1 , n2 ,...., nk , where each ni comprise of n1 :. The algorithm classify the tweet as factual or non-factual and yield output in the form of tk : tweet and N v : supporting news= nv1 , nv2 ,...., nvp . 4 Muhammad Ra, Fizza Abid, Anum Mirza, and Hamza Mustafa Khan foreach ti T do Preprocess(t) foreach ni N do Preprocess(ni ) α = StringKernelMatch(ti , ni ,) if α >threshold α then β = cosinesimilarity(ti , ni ,); if β