<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Conspiracy Tweets Using Support Vector Machines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manfred Moosleitner</string-name>
          <email>manfred.moosleitner@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Murauer</string-name>
          <email>b.murauer@posteo.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Günther Specht</string-name>
          <email>guenther.specht@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universität Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper summarizes the contribution of our team UIBK-DBISFAKENEWS to the task “FakeNews: Corona virus and 5G conspiracy” as part of MediaEval 2020. The goal for this task is to classify tweets as “5G corona virus conspiracy”, “other conspiracy”, or “non conspiracy”, based on text analysis and based on the retweet graphs. We achieved our best results using a calibrated linear SVM with word and character n-grams for the text classification task and a non-calibrated linear SVM with graph statistics for the graph classification task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The main objective in the task is to distinguish tweets and classify
them as either (1) contributing to a conspiracy suggesting that the
5G network technology caused the SARS-CoV-2 virus epidemic,
(2) contributing to a diferent conspiracy, or (3) not contribute to
a conspiracy. For the first subtask, this classification is based on
the text content of the tweets. The second subtask focuses on the
retweet and follower graph of the tweets. A detailed description
and the results of the challenge can be found in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the collection
of the data is described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>In the remainder of this overview, we present our solutions for
the two subtasks in the following Section 2, and discuss the results
thereafter in Section 3.</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>In both subtasks, the participants are allowed to submit 5 diferent
solutions, whereas the first 2 solutions of each subtask are restricted
to only use part of the information available. In the remaining 3
submissions, also external data points may be used.</p>
    </sec>
    <sec id="sec-3">
      <title>Subtask 1: Twitter Messages</title>
      <p>
        We extract character and word-based -grams from the text of
the tweets and use them as features for our classification models.
This has been shown to be efective and versatile in diferent text
classification task ranging from stance detection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to classifying
hacked tweet accounts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We tested diferent parameters in a grid
search, the values of which are listed in Table 1.
      </p>
      <p>Submissions 2 may include additional information, so we added
all features that were included in the JSON structure, which
correspond to the fields available from Twitter’s API 1. We transformed
all textual features to tf/idf normalized frequencies of -grams,
as listed in Table 1, left the numeric features were left as-is, and
mapped all categorical features to one-hot vectors.</p>
      <p>We included two additional features that were not in the JSON
ifles directly. Firstly, we crawled all URLs which were included in
the messages and extracted the content of the sites &lt;title&gt; tag,
hoping that it would contain a distinctive vocabulary. Secondly, we
used the free OCR software tesseract2 to find any text within the
images that are included in the messages.</p>
      <p>
        We tested linear support vector machines and extra random
trees as classifiers, and also added the option of calibrating the SVM
using Platt’s method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These classifiers have been well-studied
and perform well in diverse text classification tasks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and can
compete with neural-network-based approaches in many fields like
spam detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Subtask 2: Retweet-Follower-Graphs</title>
      <p>
        Standard graph statistics like the number of nodes or the graphs
degrees are known to carry characteristics about the retweet graph
to help in classification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Also, algorithms like HITS [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
PageRank [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] could produce discriminating features, as they were
used on retweet graphs by Yang et al. in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to distinguish between
tweets that are interesting only to a small group of people or a
broader audience. Thus, we used the statistical networking Python
package NetworkX 3 to extract statistical figures describing the
retweet-follower-graphs. For the first run of the second subtask, we
calculate order, size, degree, indegree, outdegree, number of connected
components, density, transitivity, pagerank, HITS (hubs, authorites),
number of partitions, planarity, and number of cycles, and combined
them into a single feature vector.
      </p>
      <p>Some of the functions in NetworkX to calculate the graph
statistics return lists of variable length, as their number depends on
the number of nodes and edges. To create fixed-length feature
vectors, we computed arithmetic mean, standard deviation, and the
ifve-number summary of the values in the individual lists, and used
these as features. For the second run in subtask 2, we additionally
used the data from the nodes files, from which we calculated min,
max, mean, and standard deviation of the number of friends and
followers, and added these to the feature vectors calculated for the
ifrst run.
2https://tesseract-ocr.github.io/
3https://networkx.org/
Word &amp; character -gram size1
SVM: C
Extra Trees: number of trees
Poly. degree
Poly. include bias
KNN: number of neighbors</p>
      <p>Tested values</p>
      <p>5G corona conspiracy
conspiracies
better
burning
−8
−8
−8</p>
      <p>Since we extracted significantly fewer features in the second
subtask, we added polynomial feature generation, and added a
gaussian naïve Bayes classifier and a K-nearest neighbor to the
models from the first subtask. Both are well-studied algorithms and
we were interested in how well they would perform for this task.
We tested several parameters in a grid search, which are displayed
in Table 1.
3</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>After preliminary experiments for both subtasks, we selected the
setup with the highest MCC score in a 10-fold cross-validation
setup as the model that predicts our submission results for each
subtask.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Subtask 1</title>
      <p>The scores displayed in Table 2a show that the SVM model clearly
outperforms the extra random trees approach in the first subtask.
Thereby, calibrating the SVM increased the performance slightly.</p>
      <p>Interestingly, the performance of the classifiers dropped when
taking more features into account for the second submission. This
indicates that either too many features are extracted from the text,
or that the additional meta-information was not expressive to the
problem. Nevertheless, we submitted the two results in this state,
being aware that we could have possibly increased the performance
of the second submission by ignoring the meta-features. The
evaluation results, on the other hand, don’t display a performance decrease
between the two submissions, where both runs result in a score of
0.440 and 0.441, respectively. As shown in Table 3, the best results
were obtained by combining word unigrams and character-3- and
-4-grams and a strict regulation parameter of C=0.1.</p>
      <p>
        Using a linear SVM as a model allows an easy interpretation of
the importance of words by looking at the respective coeficients.
For each output class, Figure 1 shows the terms with the three
highest and lowest coeficients. The high value for the term 5g
suggests that not many topics within the other conspiracies are
Similar to subtask 1, we used grid search to find the best performing
classifier and parameters. The scores of the classifiers were rather
similar, with the linear SVM producing the best score with the
parameters C=10. While using polynomial features at all increased
the result in both submissions by 0.05, whereas the parameters
(degree=[
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ], include bias=[true, false]) did not have a great
influence (&lt; 0.01 MCC). as shown in Table 3. The results in training and
evaluation approaches for subtask 2 were quite low, as displayed
in Table 2b. Interestingly, our MCC validation scores for subtask
2 were lower than the training scores, which is in contrast to the
scores of subtask 1, where the validation scores were slightly better
than our training scores.
4
      </p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>Our simple text-based approaches were able to classify the tweets
reliably, and the coeficients of the model give insights into the
most important terms. We suggest that more preprocessing might
further improve these results.</p>
      <p>The simple graph statistics, on the other hand, were not
expressive enough for this task. Here, incorporating more metadata like
the time between the retweets might improve the classification
results.</p>
      <p>FakeNews: Corona virus and 5G conspiracy</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>David</surname>
            <given-names>R Bild</given-names>
          </string-name>
          , Yue Liu,
          <string-name>
            <surname>Robert P Dick</surname>
            ,
            <given-names>Z Morley</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
          </string-name>
          , and Dan S Wallach.
          <article-title>Aggregate characterization of user behavior in twitter and analysis of the retweet graph</article-title>
          .
          <source>ACM Transactions on Internet Technology (TOIT)</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Bourgonje</surname>
          </string-name>
          , Julian Moreno Schneider, and
          <string-name>
            <given-names>Georg</given-names>
            <surname>Rehm</surname>
          </string-name>
          .
          <article-title>From clickbait to fake news detection: an approach based on detecting the stance of headlines to articles</article-title>
          .
          <source>In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism</source>
          , pages
          <fpage>84</fpage>
          -
          <lpage>89</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jon</surname>
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Kleinberg.</surname>
          </string-name>
          <article-title>Hubs, authorities, and communities</article-title>
          .
          <source>ACM computing surveys (CSUR)</source>
          ,
          <volume>31</volume>
          (4es):
          <fpage>5</fpage>
          -
          <lpage>es</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Murauer</surname>
          </string-name>
          , Eva Zangerle, and
          <string-name>
            <given-names>Günther</given-names>
            <surname>Specht</surname>
          </string-name>
          .
          <article-title>A peer-based approach on analyzing hacked twitter accounts</article-title>
          .
          <source>In Proceedings of the 50th Hawaii International Conference on System Sciences</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Octaviani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Hari</given-names>
            <surname>Rachmawanto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Sari</surname>
          </string-name>
          , and
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Rosal Ignatius Moses Setiadi. Comparison of multinomial naïve bayes classifier, support vector machine, and recurrent neural network to classify email spams</article-title>
          .
          <source>In 2020 International Seminar on Application for Technology of Information and Communication (iSemantic)</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>21</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Page</surname>
          </string-name>
          , Sergey Brin, Rajeev Motwani, and
          <string-name>
            <given-names>Terry</given-names>
            <surname>Winograd</surname>
          </string-name>
          .
          <article-title>The pagerank citation ranking: Bringing order to the web</article-title>
          .
          <source>Technical report</source>
          , Stanford InfoLab,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>John</given-names>
            <surname>Platt</surname>
          </string-name>
          .
          <article-title>Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods</article-title>
          .
          <source>Advanced Large Margin Classifiers</source>
          ,
          <volume>10</volume>
          ,
          <year>June 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Langguth</surname>
          </string-name>
          . Fakenews:
          <article-title>Corona virus and 5g conspiracy task at mediaeval 2020</article-title>
          . In MediaEval 2020 Workshop,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Thilo</surname>
          </string-name>
          <string-name>
            <surname>Schroeder</surname>
          </string-name>
          , Konstantin Pogorelov, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Langguth</surname>
          </string-name>
          .
          <article-title>Fact: a framework for analysis and capture of twitter graphs</article-title>
          .
          <source>In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)</source>
          , pages
          <fpage>134</fpage>
          -
          <lpage>141</lpage>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Simon</given-names>
            <surname>Tong</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daphne</given-names>
            <surname>Koller</surname>
          </string-name>
          .
          <article-title>Support vector machine active learning with applications to text classification</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>2</volume>
          (Nov):
          <fpage>45</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Min-Chul</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung-Tae</surname>
            <given-names>Lee</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seung-Wook Lee</surname>
          </string-name>
          , and
          <string-name>
            <surname>Hae-Chang Rim</surname>
          </string-name>
          .
          <article-title>Finding interesting posts in twitter based on retweet graph analysis</article-title>
          .
          <source>In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>1073</fpage>
          -
          <lpage>1074</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>