<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fully Automatic Approach to Identify Factual or Fact-checkable Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarthak Anand</string-name>
          <email>sarthaka.ic@nsit.net.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajat Gupta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajiv Ratn Shah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ponnurangam Kumaraguru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indraprastha Institute of Information Technology</institution>
          ,
          <addr-line>Delhi 110020</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maharaja Agrasen Institute of Technology</institution>
          ,
          <addr-line>Delhi 110086</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Netaji Subhas Institute of Technology</institution>
          ,
          <addr-line>New Delhi 110078</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the solution of the team MIDAS of IIIT Delhi for the IRMiDis track in FIRE 2018. We present our solution for the identication of factual or fact-checkable tweets from a dataset that consists of about 50,000 tweets posted during the 2015 Nepal earthquake. We provide a rule based approach for this task and compare it with a semi-supervised approach. After preprocessing steps including tokenization and cleaning, we calculate a factuality score on the basis of number of proper-nouns and quantitative values within a tweet and nally rank them according to the score. Experimental results show that this simple rule based approach provides comparable results in comparison to that of semi-supervised approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Social media analysis</kwd>
        <kwd>Unsupervised learning</kwd>
        <kwd>Information retrieval</kwd>
        <kwd>Microblogs</kwd>
        <kwd>Disaster</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Social media usage has considerably increased over the last decade. People often
use the social media for various purposes and create a huge amount of
usergenerated content. In addition to the reporting of news or events social media
platforms are increasingly being used for aiding relief operations during various
mass emergencies, e.g., during Kerala oods 2018.</p>
      <p>However, messages posted on these sites often contain rumors and false
information. In such situations, identication of factual or fact-checkable tweets,
i.e., tweets that report some relevant and veriable fact is extremely important
for eective coordination of post-disaster relief operations. Additionally, cross
verication of such critical information is a practical necessity to ensure the
trustworthiness. Considering the scale of these platforms it is not feasible to
manually check and verify dierent user-generated content on time. Since it is
very important to reach to a person who is stuck in such emergencies on time,
automated IR techniques are needed to identify, process and verify the credibility
of information from multiple sources.</p>
      <p>
        With this paper we provide one such approach which has shown the best
performance in the FIRE challenge 2018 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on identifying factual tweets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Identifying factual and non factual tweets can be treated as a supervised
classication problem. A lot of work have already been done related to supervised
based classication [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. All these works require large amounts of manually
labeled dataset.
      </p>
      <p>
        Despite most works focus on supervised techniques, some works also
employed unsupervised techniques as well. For instace, Bjorn Schuller et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
worked on knowledge based approach which does not demand labeled training
data. Moreover, Shailesh S. Deshpand et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a rule based approach
for the classication of sentences. They tested it for identifying specic and non
specic sentences. They computed several features for each sentence for
computing a specicity score for each sentence. Similar to their approach we extract
features from sentences such as the number of proper nouns(PROPN) and the
number of quantitative values(NUM) and compute a factuality score( higher score
indicates more factual information ). In our approach, we use the factual score
for ranking the tweets in order of factual information and use the top k sentences
as fact-checkable tweets.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem and Data Description</title>
      <p>Information retrieval from micro-blogs during disasters challenge had 2
subtasks. Sub-task 1 was about, identifying factual or fact-checkable tweets related
to Nepal disaster and ranking them on the basis of their factuality scores.
Subtask 2 was about, mapping the fact-checkable tweets with appropriate news
articles. The submission was categorized into 3 types based on the amount of
manual intervention i.e. Fully automatic, Semi automatic, and Manual.</p>
      <p>
        Data Description Dataset for sub-task 1 consists of about 50,000 tweets
posted during the 2015 Nepal earthquake. Dataset for sub-task 2 included around
6,000 news articles related to the 2015 Nepal earthquake. Refer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for more
details.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Automatic Methodology</title>
      <p>The problem at hand is to use tweets and rank them based on the information
they contain. The following sections describe in detail the various steps that have
been performed to achieve the results and intuition behind our approach.
1. Pre-processing of tweets, POS tagging and nding proper-nouns and
quantitative values, are described in Section 4.2.
2. Finally computing a factuality score based on proper-nouns and quantitative
values, is described in Section 4.3.
4.1</p>
      <sec id="sec-4-1">
        <title>Intuition</title>
        <p>
          Similar to the ndings of Shailesh S. Deshpande et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], in our study we nd
that tweets that contain some factual information consists of some name entities
like an organization like UN or NDRF, or proper noun such as PM Modi and
quantitative information such as date, time or numbers( e.g., 5 dead or 5 tonnes ).
Based on this study we try to score a tweet on the basis of number of proper
nouns and quantitative values which we call as factuality score.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Data Preprocessing and POS tagging</title>
        <p>Since the data given to us is raw, noisy and also prone to more errors, it cannot be
directly used for analysis. It is necessary to perform some preprocessing to make
the data more suitable so that we can perform POS tagging on the sentences.
The following preprocessing steps were performed:
1. Tokenization: Tokenization refers to the breaking down of the given text
into individual words. We use the Spacy’s word tokenizer to perform
tokenization of the tweets.
2. Normalization: We perform the following steps, very specic to tweets to
normalize our corpus:</p>
        <p>Stop-words and punctuation removal: Usually tweets consists of
mentions, hash-tags, URLs, punctuation marks and emoji’s. They are
not useful in determining the amount of information within a tweet and
hence are removed from our corpus.</p>
        <p>POS tagging In our approach, we have used two major features for computing
factuality score, i.e., the number of proper nouns and quantitative values within
a tweet. We use spacy’s POS tagger for this purpose.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Computing Factuality Score</title>
        <p>Submitted Approach 1 In this approach we compute the number of proper
nouns and number of quantitative values within the tweet. For mapping the
score to 0 and 1 we divide the number of PROPN and NUM by maximum
values achieved in their respective eld. Finally, we take average of both these
values. The Table 1 shows examples for calculating the factuality score. The
underlined words refer to proper-nouns and italicized words refer to numbers.
For these examples, note that the maximum values of PROPN and NUM were
17 and 13, respectively. (Shortcomings and suggestions for this approach are
described in Section 7)
1 Github code available at: https://github.com/isarth/Fire_task_1
For comparing our automatic approach with supervised approach. We manually
labeled around 1,500 tweets as factual and non-factual and treat the sub-task
1 (refer Section 3) as binary classication problem. The condence score of the
classier is treated as the factuality score, which is nally used for ranking the
tweets. The following section describes in detail various steps that have been
performed for the semi-automatic approach.</p>
      </sec>
      <sec id="sec-4-4">
        <title>1. Manually labeling a small set of tweets from the dataset. 2. Pre-processing steps, already described in Section 4.2 3. Training a binary classier and nally ranking tweets according the condence score (see Section 5.1 for details).</title>
        <p>5.1</p>
      </sec>
      <sec id="sec-4-5">
        <title>Binary Classier</title>
        <p>
          For classifying tweets as factual and Non-factual, we train both Fasttext [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] cbow
and bi-gram models. We split our labeled dataset into two parts training and
validation. Table 2 shows the performance of both the classiers. Finally for
ranking tweets in order of factuality, we treat the condence score of bi-gram
classier as our factuality score.
        </p>
        <p>Validation Accuracy
0.756
0.796</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Result and Analysis</title>
      <p>
        Finally Table 3 compares the results of automatic and semi-automatic approach
in the FIRE’18 challenge. Table 4 summarizes the nal results of other teams
that participated in the FIRE’18 task for automatic submission. We were ranked
rst in the competition with an NDCG score of 0.6835. The lowest NDCG score
achieved in the competition was 0.1271. Table 5 summarizes the nal results of
other teams that participated in the FIRE’18 task for semi-automatic
submission. We were ranked second in that task. For detailed results refer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
We have presented our automatic approach for calculating the factuality score
on the basis of number of proper-nouns and quantitative values within a tweet
which provided comparable results with semi automatic approach in FIRE’18
      </p>
      <p>Sarthak et al.</p>
      <p>Information Retrieval from Micro-blogs during Disasters (IRMiDis) task. The
best automatic submission achieved the NDCG score of 0.6835, that made our
team stand at rst position globally in terms of NDCG score.</p>
      <p>On further exploring we nd two minor issues in the automatic approach
described in Section 4.3 are:</p>
      <p>To overcome the above mentioned issues, we suggest having an upper-bound to
the PROPN and NUM values as . Hence for computing the individual score
we take min(propn/num, ) and nally to map score between 0 and 1 we divide
by and take the average of both the scores. Futher exploration can be done of
nding value of . These shortcomings remain, as to be solved as future work.</p>
      <p>
        We also aim to extend the model by making it more ecient by using dierent
techniques we did not explore such as using other features like TFIDF [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] score
of words, combined with the ones we already tried. Further knowledge based
classication [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] can also be explored .
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Basu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the FIRE 2018 track: Information Retrieval from Microblogs during Disasters (IRMiDis)</article-title>
          .
          <source>In: Proceedings of FIRE 2018 - Forum for Information Retrieval Evaluation (December</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Deshpande</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palshikar</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Athiappan</surname>
          </string-name>
          , G.:
          <article-title>Unsupervised approach to sentence classication (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of tricks for ecient text classication</article-title>
          .
          <source>CoRR abs/1607</source>
          .01759 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Lei</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.Z.</surname>
          </string-name>
          :
          <article-title>Empirical evaluation of rnn architectures on sentence classication task</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ramos</surname>
          </string-name>
          , J.:
          <article-title>Using tf-idf to determine word relevance in document queries</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knaup</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning and knowledge-based sentiment analysis in movie review key excerpts</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Baselines and bigrams: Simple, good sentiment and topic classication (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          , LeCun, Y.:
          <article-title>Character-level convolutional networks for text classication</article-title>
          .
          <source>CoRR abs/1509</source>
          .01626 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>