=Paper= {{Paper |id=Vol-2266/T1-6 |storemode=property |title=A String Kernel based Information Retrieval Approach for Tweet-validation through News |pdfUrl=https://ceur-ws.org/Vol-2266/T1-6.pdf |volume=Vol-2266 |authors=Muhammad Rafi,Fizza Abid,Anum Mirza,Hamza Mustafa Khan |dblpUrl=https://dblp.org/rec/conf/fire/RafiAMK18 }} ==A String Kernel based Information Retrieval Approach for Tweet-validation through News== https://ceur-ws.org/Vol-2266/T1-6.pdf
     A string kernel based Information Retrieval
   approach for tweet-validation through NEWS


  Muhammad Ra1 , Fizza Abid1 , Anum Mirza1 , and Hamza Mustafa Khan1
 National University Of Computer & Emerging Sciences - FAST, Karachi, Pakistan
                           muhammad.rafi@nu.edu.pk
                           http://www.khi.nu.edu.pk



      Abstract.    Micro-blogging websites like Twitter are very popular among
      internet users, over 100 million tweets are posted every day. The web-
      sites are active in ecient dissemination of information pertinent to any
      emergency like ood and earthquake. Recent research proved that these
      platforms can eectively be used for monitoring, evaluations and coordi-
      nating relief operations in such situations. One of the very critical issue
      of such applications is to identifying the validity of these posts automat-
      ically during the emergency situation as factual information or rumors.
      The idea is to verify the tweet from some other authentic news source. Fo-
      rum for Information Retrieval Evaluation (FIRE 2018) edition included
      a shared task for Information Retrieval from Microblogs during Disasters
      (IRMiDis). The subtask 1, is identifying the tweets from their content
      as fact or fact-checkable tweets. The main idea of this task is to iden-
      tify the validity of the tweets so that the rumors or baseless situational
      tweets can be ltered from the context of monitoring and response activ-
      ity of such challenging emergency situations. The paper proposes a string
      kernel based information retrieval approach for tweet-validation through
      NEWS. Our approach is based on two-steps information retrieval. In
      rst step, we consider a given tweet as a query and nd a best match-
      ing headline using Aho-Corasick algorithm with a score greater than .
      In the next step, we matched the content of the news with the tweet
      using cosine similarity, if this score is also greater than , it means that
      we have a supporting news item for the given tweet. We learn the values
      of α=0.11 and β =0.25 through experimentation. Our proposed approach
      performed second best in the competition with overall NDGC 0.667.

      Keywords: String kernel approach · Aho-Corasick · Vector space model
      · tweet validation


1 Introduction
Twitter is considered as a chief communication platform in critical and disaster
situations according to the scholars and the practitioners. It has been observed
that during hurricanes and real time recovery events, twitter is the most used
platform. The signicance of twitter cannot be denied as it alerts individuals and
help them to recover from such disasters. In crisis situations, through twitter,the
2       Muhammad Ra, Fizza Abid, Anum Mirza, and Hamza Mustafa Khan

voice of the common people reach to the higher authorities and they will be able
to react in best possible manner[6].
    However, the tweets can be classied as valid or rumors. Automatic rumor
detection is very much dependent on authentic source of information to be met
for surety of information. Other challenges include semantic information process-
ing, variational or piecewise information handling, biasness of the information
and initiation of the information. At times, rumors are created intentionally to
mislead the audience, in such situations, it is complicated to understand the se-
mantics completely. Rumors are of multiple types depending upon the style and
language; thus, dierent algorithms will be required [5]. Otherwise, algorithms
trained on limited data will fail due to training biases.
    In order to detect valid tweets, the hashtags/keywords will be extracted from
the tweets and then they will be compared with the news sources because news
sources are authentic and they will easily validate the tweets. NEWS items are
the most credible source from which the authenticity and validity of the tweets
can be checked. Only if the tweet is based on reality and is considered as a fact,
only then it will be discussed in the news. This approach is simple yet eective.
    Forum for Information Retrieval Evaluation (FIRE 2018) edition launched
a shared task for Information Retrieval from Microblogs during Disasters (IR-
MiDis). The main approach of subtask 1 is to identify the validity of the tweets
so that the rumors or unauthentic situational tweets can be ltered from the
context of monitoring and response activity of such challenging emergency situ-
ations [1].


2 Literature Review
A variety of communication platforms are evolved with increasing demand of in-
ternet enabled communication. Conventional media outlets; for instance, news-
papers, radio, and TV for the news, are no more use in the modern era. Nowa-
days, one of the platform i.e. Twitter has been widely used during a catas-
trophic situation, such as natural disasters, hurricanes, earthquakes etc .Twitter
was proven eective during last many catastrophic events for dissemination of
timely information, relief operations monitoring, and response to it. Twitter also
possess the tendency to enhance survival during Tornado-related disasters [4].
    One of the trivial approach for tweet validation is to see whether this informa-
tion is available from some authentic news source. Motivated from this thought,
we proposed a two-fold string kernel based approach for the problem. The rst
part of one approach nd the similarity of tweet and news using Aho-Corasick
algorithm. This algorithm has been utilized for matching keywords/hashtags
extracted from tweets with the headlines of the news titles. This string kernel
based algorithm is extremely benecial as its searching phase is straightforward
and target each and every occurrence of string [8]. The algorithm creates a nite
state machine between various internal nodes. This algorithm is extremely fast
and accurate as backtracking is not required [7] [9].
      A string kernel based IR approach for tweet-validation through NEWS          3

    The text is transformed into the vector space model so as to later compute
cosine similarity. Vector space model transforms the text (each distinct term)
into the vector. For text, the basic idea of a vector space model is to consider each
individual term as its own dimension. If we have a document D which has words
of length L, then wi is the ith word in D, where I [1...M] i [1...M]. Moreover, the
group of words in wi are called vocabulary or the term space which is denoted by
V. With vector space model (VSM), it is simple to measure similarity between
two documents. It is also utilized for document encoding (tf-idf)[3][10].
    Later, to compute similarity between the tweet and the news content, the co-
sine similarity measure has been used. Cosine similarity is the angle between two
associated vectors. In information retrieval and related topics, cosine similarity
is a commonly used metric. In this metric, text is represented as a vector of terms
and the similarity between two texts is obtained from cosine value between two
vectors of terms [2][3]. In this metric, text is represented as a vector of terms
and the similarity between two texts is obtained from cosine value between two
vectors of terms [10].
    It was a challenge to learn the accurate value of alpha (similarity between
tweet keywords/hashtag and the headlines) and β (cosine similarity between
tweet and news content). We used hit and trial method to learn the accurate
value of alpha and beta. On α=0.11 & β =0.25, the results were satisfactory.

3 Proposed Approach
The problem is formulated as Information retrieval problem. Given a tweet, we
look for the provided news items, to see whether it has some supporting news
or not. There are two distinct parts of a news item (i) NEWS Title or heading
and (ii) NEWS content. Our approach uses two- fold similarity measures for
identifying factual tweets. At rst we compute a string based similarity form
tweet and NEWS heading using Aho-Corasick algorithm for string-searching,
the complexity of this approach is linear as NEWS headings are short text, it
is quite quick. All the news items that have a similarity value higher than α (a
xed value we learn from the data), we compute the similarity of NEWS content
using simple vector based cosine similarity. If a tweet has a supporting news
item(s) and it has a similarity score higher than beta (another xed parameter).
The news item is retrieved as supporting, we arrange all supporting news with
decreasing order of similarity as rank scores.
    The algorithm is given below. It takes as input T: Set of tweet= t1 , t2 ,....,
tn and N: set of NEWS= n1 , n2 ,...., nk , where each ni comprise of n1 : 
. The algorithm classify the tweet as factual or non-factual and yield output in
the form of tk : tweet and N v : supporting news= nv1 , nv2 ,...., nvp .
4         Muhammad Ra, Fizza Abid, Anum Mirza, and Hamza Mustafa Khan




    foreach ti    T do
       Preprocess(t)
        foreach ni  N do
          Preprocess(ni )
           α = StringKernelMatch(ti , ni , )
           if α >threshold α then
              β = cosinesimilarity(ti , ni , );
              if β