<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DL-TXST FakeNews: Enhancing Tweet Content Classification with Adapted Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhieddine Shebaro</string-name>
          <email>m.shebaro@txstate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Oliver</string-name>
          <email>jasonoliver@txstate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomiwa Olarewaju</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jelena Tešić</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Texas State University</institution>
          ,
          <addr-line>San Marcos TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>DL-TXST team participation runs submitted to the MediaEval Fake News task this year focused on improving the baseline benchmark pre-processing and modeling. We have introduced features learned from large, adapted language models. The predictive power of our pipeline was the strongest when we included the BERT model tuned to Tweet content. Subtask 1 on the test set had MCC 0.1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        With today’s modern technology, breaking news, from the latest
celebrity gossip to updates on unprecedented events like the
COVID-19 pandemic, are now available with just a few taps on
your smartphone. As the availability and volume of readily
available information has grown, so has the rise of misinformation.
Fake news is specifically designed to plant a seed of mistrust and
exacerbate existing social and cultural dynamics by misusing
political, regional, and religious undercurrents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. “In 2019, 8
percent of engagement with the 100 top-performing news sources
on social media was dubious. In 2020, that number more than
doubled to 17 percent” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Twitter’s purpose has been advertised
to the public as a platform that “uniquely provides its users the
opportunity to discover what's happening in the world” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unique
includes fake, so the Twitter platform has become an easy target for
the rapid dissemination of skewed facts to the world, as seen with
the attribution of the current COVID-19 pandemic to novel 5G
technology. Topical automated classification systems with potent
predictive power for innumerable conspiracies are urgently needed
to curb the spread of inaccurate news. In this paper, we focus on
content-based fake news detection strategies.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORKS</title>
      <p>
        This problem of misinformation in social media is universally faced
by any user of a social media site. These users, as well as the private
companies who run these social media sites, have a vested interest
in ensuring that the information on the platform is beneficial to the
consumer (the users). For most users, this means that information
is accurate and can be trusted as valid. For example, rumors have
surfaced in the past about McDonald’s use of worm filler in its
food. This has caused tremendous boycott threats [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA MANAGEMENT</title>
      <p>
        The most recent data is collected from MediaEval’s FakeNews:
Coronavirus and 5G Conspiracy benchmark project [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and is
integrated with the data of the previous analysis and work that was
retrieved using TwitterAPI. We used several pre-processing
methods on the data. First, we used the baseline pre-processing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
which included converting to lowercase; removing punctuation;
preserving URLs; removing stop words; and normalizing terms
(“u.k” to UK). Our pre-processing enhancements to the pipeline
this year include removing usernames (Twitter handle); removing
all special characters; removing hashtags; removing contractions
(e.g., convert “won’t” to “will” and “not”); removing non-English
Tweets if present, removing links (which not only incorporates
“https://t.co/”, but also “http” and “www”), and removing
Emojis. When we looked at the dataset for Subtask 2 and 3, the
Tweet was divided into several parts, and each part was present in
a separate column. To deal with this, we merged them into one
column in the data frame separated by a space. The validation size
was set to 0.2 to partition our dataset for the sake of evaluating our
model’s predictive power according to a set of predefined metrics.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4 EXPERIMENTS 4.1</title>
      <p>Subtask 1 The objective is to build a multi-class
classifier that can flag whether a Tweet promotes, supports or
discusses at least one (or many) of the conspiracy theories.
Pre-Processing. Links that contain or start with “https://t.co/” are
removed, but links such as the ones beginning with “http” and
“www” are still present even after applying the control’s
normalization. Username handles are also not filtered out.
Data Integration. Combining two datasets requires them to have
the same dimensions, as well as consistent and meaningful class
labels. We observed that there are some discrepancies between
these two datasets, which would impede the flow of the integration
process. For this reason, before integration, we began by carefully
selecting class labels from fine-grained classification that would
make sense in the new dataset. We replaced the class label of tuples
that is 1 with 3 and 3 with 1. We also came to a consensus that label
2 is irrelevant in our new context. Thus, we excluded all tuples
having this class label. To form a uniform dataset with a uniform
number of dimensions, we extracted only the “Tweet” and the
“Label” dimensions from the old dataset, finally rendering the
previous dataset integrable with the new one (no missing Tweets
detected). Before fusion, the number of tuples of the old dataset was
5,946 rows. After integration and removing rows that are labeled as
2, we got a total of 6,769 tuples.</p>
      <p>
        Modeling We chose Logistic Regression as a baseline model
because it performed the best in the control experiment [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
However, we modified it by applying some hyperparameter tuning
to adjust to our new, fully integrated dataset. For example, we
altered the class weight attribute to 1: 0.1, 2: 0.7, 3: 0.2. We also
increased maximal iterations from 2000 to 4000 because there were
some instances in which the model did not converge. We kept the
same feature extracting technique (CountVectorizer) and utilized
the spacy to tokenize our text. In addition, the test size was set to
0.2 to partition our dataset for the sake of evaluating our model’s
predictive power according to a set of predefined metrics. We
utilized the voting classifier to combine several selected models.
The selected models were based on similar related works on Tweets
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They are SVC, Multinomial NB, Logistic Regression, and
Random Forest Classifier. The voting type was set to “hard.”
BERT for Tweets “BERT-large was trained on 64 TPU chips for
four days at an estimated cost of $7,000.” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Selecting a pretrained
model for BERT is a crucial step when fitting your model. For
instance, we initially used the pretrained model offered by Google
(BERT-Large, Uncased) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We ended up with dismal results. As
it turns out, the BERT-Large pretrained model was trained and is
based on conversational English text. However, we know that the
structure and nature of Tweets are very different from those of any
other text. For this reason, we searched to find a pretrained model
for Tweets, and we stumbled upon BERTweet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We based our
code on a similar work that was already done on Kaggle, but for
disaster Tweets. This code has utilized BERTweet pretrained
model from “Vinai” [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with some modifications. For example,
shifting our class labels (1 to 0, 2 to 1, and 3 to 2) was a requirement
for BERTweet to work. We kept the same hyperparameters (5
epochs and batch set size to 8) and changed the num_classes
parameter.
4.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Subtask 2 &amp; 3</title>
      <p>Encoding &amp; Decoding Since our data contains multiple target
variables, it was beyond the models’ inference capabilities of more
than one dependent variable at once. So, we came up with an idea
to encode every occurrence of a combination of binary target
variables into a single target variable. For example,
0,0,0,0,0,0,0,0,0 has 754 occurrences and we encoded it with 0. For
the sake of simultaneously reducing the number of class labels and
improving generalization, we decided to apply a threshold to
remove any rare combination of binaries that has an occurrence less
than the threshold. We found that the ideal threshold, 20, would
capture the most frequent occurrences. A total of 10 encodings
(class labels) were produced after using this threshold. When the
model produces a label 3, the decoding process is going to translate
it back into 0,0,0,0,0,0,1,0,0. The same experiments were applied
to subtasks 2 and 3, except for the data integration, as the class
labels of the old dataset are irrelevant in this context.</p>
      <p>Subtask 1</p>
    </sec>
    <sec id="sec-6">
      <title>5. RESULTS</title>
      <p>BERTweet outperforms all models in terms of multiple metrics
on validation set for subtask 1, as illustrated in Figure 2.</p>
      <p>CONCLUSION Tweet content normalization techniques improve
the predictive power of the pipeline. BERTweet was significantly
better at predicting the subtask 1 data with MCC 0.106. The new
Normalizations + Logistic Regression performed the best in both
subtasks 2 and 3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Claire</given-names>
            <surname>Wardle</surname>
          </string-name>
          , Hossein Derakhshan, INFORMATION DISORDER:
          <article-title>Toward an interdisciplinary framework for research and policy making</article-title>
          ,
          <source>DGI</source>
          (
          <year>2017</year>
          )09,
          <string-name>
            <surname>Avenue de l'Europe F - 67075</surname>
            <given-names>Strasbourg</given-names>
          </string-name>
          <string-name>
            <surname>Cedex</surname>
          </string-name>
          , France: Council of Europe,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kate</given-names>
            <surname>Taylor</surname>
          </string-name>
          . “
          <article-title>A viral rumor that McDonald's uses ground worm filler in burgers has been debunked” Business Insider</article-title>
          . https://www.businessinsider.
          <article-title>com/debunked-mcdonalds-usesworm-filler-2016- 1#:~:text=A%20viral%20rumor%20that%20McDonald's</article-title>
          ,
          <source>in%20 burgers%20has%20been%20debunked&amp;text=Robert%20Galbrait h%2FReuters%20If%20you,worry%20%E2%80%94%20it%20is %20completely%20false (accessed September 19</source>
          ,
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Emily</given-names>
            <surname>Stewart</surname>
          </string-name>
          . “
          <article-title>America's growing fake news problem, in one chart” Vox</article-title>
          . https://www.vox.com/policy-andpolitics/
          <year>2020</year>
          /12/22/22195488/fake-news
          <string-name>
            <surname>-</surname>
          </string-name>
          social-media-2020
          <source>(accessed September 19</source>
          ,
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Cartier</given-names>
            <surname>Stennis</surname>
          </string-name>
          . “
          <article-title>Defining what makes Twitter's audience unique” Twitter Blog</article-title>
          . https://blog.twitter.com/en_us/topics/insights/2018/defining
          <article-title>-whatmakes-twitters-audience-unique (accessed September 19,</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Magill</surname>
          </string-name>
          , Lia Nogueira De Moura, Maria Tomasso, Mirna Elizondo, Jelena Tešić. “
          <article-title>Enriching Content Analysis of Tweets using Community Discovery Graph Analysis”</article-title>
          , MediaEval 2020 workshop paper.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , and Daniel Thilo Schroeder, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Brenner</surname>
          </string-name>
          , and Johannes Langguth.
          <source>FakeNews: Corona Virus and Conspiracies Multimedia Analysis Subtask at MediaEval 2021. Proc. of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ankit</surname>
          </string-name>
          , &amp;
          <string-name>
            <surname>Saleena</surname>
            ,
            <given-names>Nabizath.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>An Ensemble Classification System for Twitter Sentiment Analysis</article-title>
          .
          <source>Procedia Computer Science</source>
          .
          <volume>132</volume>
          .
          <fpage>937</fpage>
          -
          <lpage>946</lpage>
          .
          <fpage>10</fpage>
          .1016/j.procs.
          <year>2018</year>
          .
          <volume>05</volume>
          .109.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Roy</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , Jesse Dodge, Noah A.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>Oren</given-names>
          </string-name>
          <string-name>
            <surname>Etzioni</surname>
          </string-name>
          . “GreenAI”. https://dl.acm.org/doi/fullHtml/10.1145/3381831 (accessed
          <issue>October 20</issue>
          ,
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          . “Google Research / BERT”. Github. https://github.com/google-research/bert (accessed
          <source>October 20</source>
          ,
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Dat</given-names>
            <surname>Quoc</surname>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , Thanh Vu, and Anh Tuan Nguyen. “
          <article-title>BERTweet: A pre-trained language model for English Tweets''</article-title>
          . Aclanthology. https://aclanthology.org/
          <year>2020</year>
          .emnlp-demos.2.pdf.
          <source>(accessed October 20</source>
          ,
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bachfischer</surname>
          </string-name>
          . “Disaster Tweets - BERTweet”. GitHub, https://www.kaggle.com/matthiasbachfischer/disastertweets-bertweet
          <source>(accessed October 20</source>
          ,
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , and Daniel Thilo Schroeder, and
          <article-title>Petra Filkuková, and Stefan Brenner, and Johannes Langguth. WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets</article-title>
          .
          <source>Proc. of the 2021 Workshop on Open Challenges in Online Social Networks</source>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>25</lpage>
          .
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>