<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enriching Content Analysis of Tweets Using Community Discovery Graph Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrew Magill</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lia Nogueira De Moura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Tomasso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirna Elizondo</string-name>
          <email>mirnaelizondo@txstate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jelena Tešić</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Texas State University</institution>
          ,
          <addr-line>San Marcos TX</addr-line>
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the proposed solutions of the Texas State University Data Lab team to the MediaEval 2020 FakeNews benchmark. We have responded to the text-based and structural-based tasks with lexical, graph, and community labeling approaches. Our lexical analysis approach using logistic regression produces the best results at 0.43 MCC, we describe a promising community labeling model, and discuss our attempts to find predictive structural features in retweet graphs of conspiracy promoting tweets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Our analysis assumes that people in the same social network
community who agree on fake news also write in a similar style, discuss
similar topics, produce similar content, and share similar values.
We relate content of the tweets using lexical analysis, employ
community discovery by building a network of re-tweets, and employ
network analysis on structural data provided.</p>
    </sec>
    <sec id="sec-2">
      <title>2 RELATED WORK</title>
      <p>
        Analysis of tweet content spans from the use of Bag-Of-Words
features in classification model to capture most likely terms associated
with fake news [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to using lexical analysis to characterise writing
style in fake news articles [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Community-based modeling of social
networks that leverages the spread of information in social media
through re-tweets and comments has been shown to improve
NLPbased modeling [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and entire-graph clustering has shown promise
in community identification on large scale [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Structural modeling
is in its infancy for social networks, based on the promising
direction on applying deep neural network classification of URLs as fake
or trusted based on the propagation patterns [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>3.1 Text-Based Misinformation Detection</title>
      <p>Twitter restricts tweet content to 280 characters, a constraint that
tends to influence a writing style that difers from that found in most
corpora. To achieve brevity, users employ a lexicon that includes
abbreviations, colloquialisms, hashtags, and emoticons. Tweets may
also contain frequent misspellings. The context of a tweet is also
richer as it resides in a rich network of retweets and replies. To this
end, we employ lexical-based analysis and community analysis for
tweet content and context.</p>
      <p>
        Lexical Analysis Pipeline implements transformation of
twitter content, feature extraction, and modeling, to make predictions
for the NLP-based task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We analysed the eficacy of common
preprocessing techniques and tokenization patterns to extract the most
useful features for prediction. Efective preprocessing techniques
included: transforming text to lowercase, removing terms very
common to all classes (stopwords), removing punctuation, preserving
URLs, and normalizing several specific terms (’u.k.’ to ’uk’).
Tokenization patterns that preserved and uniquely encoded emoticons
and punctuation did not improve predictive performance.
Normalizing our text using stemming and lemmatization, methods that map
similar but distinct terms to the same encoding, produced mixed
results. We decided that it may be more beneficial to preserve
distinctive characteristics of our text, and we opted to leave these
processes out of our final pipeline.
      </p>
      <p>Feature extraction in text can be accomplished by encoding
terms as vectors representing either the occurrence of terms in text
(Bag-Of-Words), or the impact of terms to a document in a corpus
(TF-IDF). We employed a TF-IDF vectorizer to reduce the impact of
repeated terms within individual tweets, but did not see
improvements over the Bag-Of-Words model in our pipeline. We trained a
set of classifiers (Naive Bayes, SVC, Random Forest, and Logistic
Regression) on our extracted feature vector, and analysed the
performance of both the fine-grained four class and coarse-grained two
class classification sub-tasks. To account for imbalance in data, we
have experimented with data augmentation, generating fake tweets
using the most predictive or most common terms for each class.
This approach led to overfitting of most classifiers, and we have
dropped it early on. We have extended the feature set in the tweets
using Optical Character Recognition of images embedded in tweets.
We have also adjusted class weights to account for imbalanced data,
when possible. Logistic regression showed superior performance
in all our test runs, and in our submission we have included runs
labeled LR (Logistic Regression) and LR-OCR (Logistic Regression
with Optical Character Recognition text augmentation) in Table 1.</p>
      <sec id="sec-4-1">
        <title>Community Analysis Pipeline applies community finding</title>
        <p>
          work [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to assign discovered communities in large social networks.
We extend the provided dataset with an auxiliary dataset that
contains tweets related to the hashtags #Coronovavirus, #Covid19, and
#Covid-19, collected from March to September 2020, with over 3.2
million users and 8 million tweets [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. First, we create three diferent
networks from the raw data: User Connections from provided data:
vertex is a user, and each edge is the connection between two users
by either a retweet, quote, reply, or mention; Hashtag Connections
from provided data: vertex in the network is a hashtag, and edge
exists between two hashtags if they were used together in the same
tweet; and User Connections 8M: a network created from provided
data and the auxiliary dataset of over 8M tweets, where vertices
and edges of the network created the same way as the User
Connections network. Next, we extract the degree of connectivity for
each of the provided conspiracy labels (5G, non, and other) driven
by observation that if vertices are well connected their content is
A.Magill, L.N.De Moura, M.Tomasso, M.Elizondo, J.Tešić
similar. We employ the Louvain Community discovery method to
discover communities in all three networks, and apply to specific
tweets information from each network analyzed [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We labeled
each community with one of the three conspiracy categories (5G,
non, other), based on majority of the labels for that community
associated with the tweet label. If we found a community where 5G
labels are larger than non or other, we will use 5G label to assign
the label to unlabeled tweets in that community. These assignments
were done based on the combination of communities found in all
three networks. Tweets that did not belong to any community, or
belonged to a community with tweets strictly originating from the
test dataset, were assigned based on their degree of connectivity,
and the remaining were assigned as Unknown. Many unknowns
were found because a large number of tweets did not have any
connections with other users in the given datasets (no retweets, replies,
quotes, mentions, or hashtags). The community discovery approach
can be useful for datasets where the users are well connected to
each other, as shown in runs labeled as CL in Table 1. Fusion Run
for both methods is labeled LR-CL in Table 1: it implements simple
fusion algorithm: for all tweets where confidence for LR is under
certain threshold, CL label is used. Details are described in Sec. 4.
3.2
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Structure-Based Misinformation Detection</title>
      <p>Standard summary statistics for graphs are used here as feature
vectors. We extract 19-dimensional feature vector from each
provided graph adjacency matrix using ’igraph’ R package, and the
dimensions represent: number of nodes, number of edges,
diameter of the graph, mean distance, edge density, reciprocity, global
transitivity, local transitivity, number of triangles, mean in-degree,
maximum in-degree, minimum in-degree, mean out-degree,
maximum out-degree, minimum in-degree, mean total degree, maximum
total degree, and minimum total degree. Feature vectors are
normalized and fed into a series of python scikit-learn classifiers: (1)
Decision tree with no maximum depth and the Gini impurity or
entropy criterion; (2) Linear discriminant analysis (LDA) with SVD,
LSQR, and LSQR plus shrinkage, and (3) Naive Bayes. An 80/20
train/test split of the development data was used to train the models.
Each classifier was trained for coarse classification and 4 way fine
classification.
4</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>Text-Based Misinformation Detection: for each of the
coarsegrained and fine-grained classifications the team has submitted
one run of predictions from our community labeling model, two
runs from our lexical analysis pipeline, and one run combining
the two approaches, for a total of eight sets of predictions. Table 1
summarizes our returned test set results, and our own evaluations
on the development set using the provided ground truth labels.
Lexical Analysis Pipeline using logistic regression produces the
highest MCC for both classification subtasks. Note that OCR text
augmentation does not improve the MCC on test set, even as it
shown improvements for multi-class subtask for development set.</p>
      <sec id="sec-6-1">
        <title>Community Analysis Pipeline only run does not achieve</title>
        <p>high MCC on test set, but it does contribute to comparable MCC
on test set (0.363), and higher precision, and comparable recall and
accuracy on development set. We ran out of time to implement
meaningful fusion of lexical and community runs. Internal analysis</p>
      </sec>
      <sec id="sec-6-2">
        <title>Evaluation Set</title>
        <p>Run Model
showed that number of tweets that are isolated from the network
degrades the performance of community based approach.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Evaluation Set</title>
        <p>Run Model</p>
        <p>Structure-Based Misinformation Detection runs are reported
in Table 2, where the highest overall MCC on the test set was 0.0327
for LDA based coarse classifiers. There is not enough information
to mine in the provided structure data to capture meaningful
relations, as some of the runs produce MCC lower than random. Note
that proposed community-based network produces higher MCC
scores (Table 1) alone as it takes into account hashtags, retweets,
and community discovery over larger corpora.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5 DISCUSSION AND OUTLOOK</title>
      <p>
        Lexical based analysis produced the highest MCC score on the test
set. Our community discovery method showed some promise in
the fusion approach with increased precision. Community-based
and structure-based methods will likely contribute more if we
consider conspiracy vs non-conspiracy classification, as recent work
has shown diferent dispersion patterns regardless of the
conspiracy topic [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Next steps are refined fusion approach and use of
community-based scores as features for structure-based approach.
FakeNews: Corona virus and 5G conspiracy
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Indra</surname>
          </string-name>
          et al.
          <year>2016</year>
          .
          <article-title>Using logistic regression method to classify tweets into the selected topics</article-title>
          .
          <source>In Intl. Conf. on Advanced Computer Science and Information Systems (ICACSIS)</source>
          . IEEE, NY,
          <fpage>385</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Monti</surname>
          </string-name>
          et al.
          <year>2019</year>
          .
          <article-title>Fake News Detection on Social Media using Geometric Deep Learning</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .SI/
          <year>1902</year>
          .06673
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Vosoughi</surname>
          </string-name>
          et al.
          <year>2018</year>
          .
          <article-title>The spread of true and false news online</article-title>
          .
          <source>Science</source>
          <volume>359</volume>
          ,
          <issue>6380</issue>
          (
          <year>2018</year>
          ),
          <fpage>1146</fpage>
          -
          <lpage>1151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Magill</surname>
          </string-name>
          and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Tomasso</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Fake News Twitter Data Analysis</article-title>
          . https://github.com/DataLab12/fakenews. (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Lia</given-names>
            <surname>Nogueira</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Social network analysis at scale: Graph-based analysis of Twitter trends and communities</article-title>
          .
          <source>Master's thesis</source>
          . Texas State University. https://digital.library.txstate.edu/handle/10877/12933
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[6] Zhou and Zafarani</source>
          .
          <year>2019</year>
          .
          <article-title>Fake News Detection: An Interdisciplinary Research</article-title>
          .
          <source>In WWW Proceedings. ACM, NY</source>
          ,
          <volume>1292</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>