<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A string kernel based Information Retrieval approach for tweet-validation through NEWS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Ra</string-name>
          <email>muhammad.rafi@nu.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fizza Abid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anum Mirza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hamza Mustafa Khan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University Of Computer &amp; Emerging Sciences - FAST</institution>
          ,
          <addr-line>Karachi</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Micro-blogging websites like Twitter are very popular among internet users, over 100 million tweets are posted every day. The websites are active in ecient dissemination of information pertinent to any emergency like ood and earthquake. Recent research proved that these platforms can eectively be used for monitoring, evaluations and coordinating relief operations in such situations. One of the very critical issue of such applications is to identifying the validity of these posts automatically during the emergency situation as factual information or rumors. The idea is to verify the tweet from some other authentic news source. Forum for Information Retrieval Evaluation (FIRE 2018) edition included a shared task for Information Retrieval from Microblogs during Disasters (IRMiDis). The subtask 1, is identifying the tweets from their content as fact or fact-checkable tweets. The main idea of this task is to identify the validity of the tweets so that the rumors or baseless situational tweets can be ltered from the context of monitoring and response activity of such challenging emergency situations. The paper proposes a string kernel based information retrieval approach for tweet-validation through NEWS. Our approach is based on two-steps information retrieval. In rst step, we consider a given tweet as a query and nd a best matching headline using Aho-Corasick algorithm with a score greater than . In the next step, we matched the content of the news with the tweet using cosine similarity, if this score is also greater than , it means that we have a supporting news item for the given tweet. We learn the values of =0.11 and =0.25 through experimentation. Our proposed approach performed second best in the competition with overall NDGC 0.667.</p>
      </abstract>
      <kwd-group>
        <kwd>String kernel approach</kwd>
        <kwd>Aho-Corasick</kwd>
        <kwd>Vector space model tweet validation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Twitter is considered as a chief communication platform in critical and disaster
situations according to the scholars and the practitioners. It has been observed
that during hurricanes and real time recovery events, twitter is the most used
platform. The signicance of twitter cannot be denied as it alerts individuals and
help them to recover from such disasters. In crisis situations, through twitter,the
voice of the common people reach to the higher authorities and they will be able
to react in best possible manner[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        However, the tweets can be classied as valid or rumors. Automatic rumor
detection is very much dependent on authentic source of information to be met
for surety of information. Other challenges include semantic information
processing, variational or piecewise information handling, biasness of the information
and initiation of the information. At times, rumors are created intentionally to
mislead the audience, in such situations, it is complicated to understand the
semantics completely. Rumors are of multiple types depending upon the style and
language; thus, dierent algorithms will be required [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Otherwise, algorithms
trained on limited data will fail due to training biases.
      </p>
      <p>In order to detect valid tweets, the hashtags/keywords will be extracted from
the tweets and then they will be compared with the news sources because news
sources are authentic and they will easily validate the tweets. NEWS items are
the most credible source from which the authenticity and validity of the tweets
can be checked. Only if the tweet is based on reality and is considered as a fact,
only then it will be discussed in the news. This approach is simple yet eective.</p>
      <p>
        Forum for Information Retrieval Evaluation (FIRE 2018) edition launched
a shared task for Information Retrieval from Microblogs during Disasters
(IRMiDis). The main approach of subtask 1 is to identify the validity of the tweets
so that the rumors or unauthentic situational tweets can be ltered from the
context of monitoring and response activity of such challenging emergency
situations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Literature Review</title>
      <p>
        A variety of communication platforms are evolved with increasing demand of
internet enabled communication. Conventional media outlets; for instance,
newspapers, radio, and TV for the news, are no more use in the modern era.
Nowadays, one of the platform i.e. Twitter has been widely used during a
catastrophic situation, such as natural disasters, hurricanes, earthquakes etc .Twitter
was proven eective during last many catastrophic events for dissemination of
timely information, relief operations monitoring, and response to it. Twitter also
possess the tendency to enhance survival during Tornado-related disasters [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        One of the trivial approach for tweet validation is to see whether this
information is available from some authentic news source. Motivated from this thought,
we proposed a two-fold string kernel based approach for the problem. The rst
part of one approach nd the similarity of tweet and news using Aho-Corasick
algorithm. This algorithm has been utilized for matching keywords/hashtags
extracted from tweets with the headlines of the news titles. This string kernel
based algorithm is extremely benecial as its searching phase is straightforward
and target each and every occurrence of string [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The algorithm creates a nite
state machine between various internal nodes. This algorithm is extremely fast
and accurate as backtracking is not required [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        The text is transformed into the vector space model so as to later compute
cosine similarity. Vector space model transforms the text (each distinct term)
into the vector. For text, the basic idea of a vector space model is to consider each
individual term as its own dimension. If we have a document D which has words
of length L, then wi is the ith word in D, where I [1...M] i [1...M]. Moreover, the
group of words in wi are called vocabulary or the term space which is denoted by
V. With vector space model (VSM), it is simple to measure similarity between
two documents. It is also utilized for document encoding (tf-idf)[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Later, to compute similarity between the tweet and the news content, the
cosine similarity measure has been used. Cosine similarity is the angle between two
associated vectors. In information retrieval and related topics, cosine similarity
is a commonly used metric. In this metric, text is represented as a vector of terms
and the similarity between two texts is obtained from cosine value between two
vectors of terms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this metric, text is represented as a vector of terms
and the similarity between two texts is obtained from cosine value between two
vectors of terms [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>It was a challenge to learn the accurate value of alpha (similarity between
tweet keywords/hashtag and the headlines) and (cosine similarity between
tweet and news content). We used hit and trial method to learn the accurate
value of alpha and beta. On =0.11 &amp; =0.25, the results were satisfactory.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <p>The problem is formulated as Information retrieval problem. Given a tweet, we
look for the provided news items, to see whether it has some supporting news
or not. There are two distinct parts of a news item (i) NEWS Title or heading
and (ii) NEWS content. Our approach uses two- fold similarity measures for
identifying factual tweets. At rst we compute a string based similarity form
tweet and NEWS heading using Aho-Corasick algorithm for string-searching,
the complexity of this approach is linear as NEWS headings are short text, it
is quite quick. All the news items that have a similarity value higher than (a
xed value we learn from the data), we compute the similarity of NEWS content
using simple vector based cosine similarity. If a tweet has a supporting news
item(s) and it has a similarity score higher than beta (another xed parameter).
The news item is retrieved as supporting, we arrange all supporting news with
decreasing order of similarity as rank scores.</p>
      <p>The algorithm is given below. It takes as input T: Set of tweet= t1 , t2,....,
tn and N: set of NEWS= n1, n2,...., nk, where each ni comprise of n1 : &lt;hi , ci&gt;
. The algorithm classify the tweet as factual or non-factual and yield output in
the form of tk : tweet and N v : supporting news= nv1, nv2,...., nvp.
foreach ti T do</p>
      <p>Preprocess(t)
foreach ni N do</p>
      <p>Preprocess(ni)</p>
      <p>= StringKernelMatch( ti, ni, &lt;hi, ci&gt;)
if &gt;threshold then</p>
      <p>= cosinesimilarity(ti, ni, &lt;hi, ci&gt;);
if &lt;threshold then
tk=ti;</p>
      <p>N u=N u;
end
end
end
return (tk,N v);
end
The dataset contained 5006 tweets. And, for validation of tweets, news items
were also present in the dataset. In order to test our approach, we rst run
the algorithm on 100 tweets. The rst submission is based on tweet similarities
with NEWS Headlines and NEWS Content. Aho-Corasick is used for tweet and
headline similarity computation ( ) and if the value of is higher than 0.11,
the similarity of content with the tweet, is computed using cosine similarity ( ).
If it is higher than 0.25, we retrieved all such news items as a supporting news
items in decreasing order of similarity scores. Whereas, in Run #2, there are
hash-tags in the tweets, these has-tags are extracted from tweets and hashtag
based similarity is computed with these hashtags and the news headlines. If these
values are higher than 0.11, we compute again the similarity of content using
cosine similarity if it is higher than 0.25, we retrieved all such news items as
a supporting news items in decreasing order of similarity scores. On the other
hand, Run #3 combines both the techniques mentioned in Run 1 and Run 2 but
the results were not eective.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Results &amp; Discussion</title>
      <p>The algorithm was initially tested on 100 tweets to automatically and manually
test the string kernel based approach. On executing Run 1, the accuracy was
27%. On manually evaluating the results, we observed that the keywords from
the tweets are not that accurate as up till yet no library has been designed to
extract keywords from short text and research is in process. We have used the
library RAKE, which is used for long text. Overall NDCG score was not
satisfactory. Run 2 showed relatively better scores. It was based on hashtag approach.
The accuracy was 70%. Various tweets did not contain hashtags; otherwise, the
accuracy would have been increased. Lastly, the accuracy of Run 3 was 28%;
however, the NDCG score was 0.3108, which is not ecient.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>A lot of research in the area of fact checking of information has been done during
last 10 years. In Future, the idea to extract keywords from the tweets can be
used for this approach. One of the trivial approach to information validation is to
verify it through authentic news sources. Besides, identifying reliable news items
related to tweet and computing a credibility score is an active area of research We
have proposed a string kernel based approach for this task. Our proposed system
was ranked second as in the competition. There are two possible extension of
this approach we foresee 1) Semantic matching of tweet and news content and
2) word sense disambiguation would denitely increase the performance of the
system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Basu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the FIRE 2018 track: InformationRetrieval from Microblogs during Disasters (IRMiDis)</article-title>
          .
          <source>In: Proceedings of FIRE 2018 - Forum for Information Retrieval Evaluation (December</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Pradhan</surname>
          </string-name>
          . N.,
          <string-name>
            <surname>Gyanchandani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wadhvani</surname>
          </string-name>
          ,R .:
          <article-title>Article: A Review on Text Similarity Technique used in IR and its Application</article-title>
          .
          <source>International Journal of Computer Applications</source>
          ,
          <volume>120</volume>
          (
          <issue>9</issue>
          ),
          <fpage>29</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Price</surname>
            . S, Flach,
            <given-names>P.A</given-names>
          </string-name>
          , Spiegler. .:
          <article-title>SubSift: a novel application of the vector space model to support the academic research process</article-title>
          . PMLR,(
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Niles</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Emery</surname>
            ,
            <given-names>B. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reagan</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>C. M..:</given-names>
          </string-name>
          <article-title>Average individuals tweet more often during extreme events: An ideal mechanism for social contagion</article-title>
          . arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>07451</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <source>Automatic Rumor Detection on Microblogs: A Survey</source>
          . arXiv preprint arXiv:
          <year>1807</year>
          .
          <volume>03505</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tandoc</surname>
            <given-names>Jr</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.C.</given-names>
            and
            <surname>Carmichael</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Communicating on Twitter during a disaster: An analysis of tweets during Typhoon Haiyan in the Philippines</article-title>
          .
          <source>Computers in Human Behavior 50</source>
          <volume>392398</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hasib</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Saxena</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Importance of aho-corasick string matching algorithm in real world applications</article-title>
          .
          <source>international journal of computer science and information technologies 4</source>
          (
          <issue>3</issue>
          ),
          <volume>467469</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lodhi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saunders</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Cristianini</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Watkins</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Text classication using string kernels</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <fpage>419</fpage>
          <lpage>444</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Aho</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Corasick</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Ecient string matching: an aid to bibliographic search</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>18</volume>
          (
          <issue>6</issue>
          ),
          <volume>333340</volume>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Paul</surname>
            ,M.: Drugs and
            <given-names>Popular</given-names>
          </string-name>
          <string-name>
            <surname>Culture</surname>
          </string-name>
          . Willan, (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>