<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Profiling Spreaders of Disinformation on Twitter: IKMLab and SoftBank Submission</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timothy Niven</string-name>
          <email>tim.niven.public@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hung-Yu Kao</string-name>
          <email>hykao@mail.ncku.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hsin-Yang Wang</string-name>
          <email>hsinyang.wang@g.softbank.co.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Knowledge Management Lab National Cheng Kung University</institution>
          ,
          <addr-line>Tainan, Taiwan and AI Strategy Office, SoftBank Corp., Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>The problem we address is classifying whether a Twitter user has spread confirmed disinformation or not. We used two types of features that had validity in the training set: features that indicate thoughtfulness, and features reflecting emotional states. We attempted to capture thoughtfulness via the rate of function word usage and constituency tree features reflecting sentence complexity. We added features for sentiment in general and negative sentiment in particular to measure emotional arousal. We also experimented with custom lexicons for anger and distrust. Our classifier was an ensembled Support Vector Classifier, Random Forest, and Naive Bayes algorithms. We only considered the English data. Our cross-validated training set accuracy was 89.3%, but significantly overfit the training data, achieving 61.0% test set accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The “Profiling spreaders of fake news” task at PAN 2020 [15] supplies 100 tweets per
300 users, where half of the users are confirmed to have spread deliberate
disinformation.1 The tweets are almost entirely retweets and shared news headlines. Task
participants are required to develop techniques to separate the spreaders of disinformation
from the controls. The task authors provide English and Spanish datasets to facilitate
multilingual techniques. This is an exciting contribution, as solutions that work across
languages are likely to reflect learning about the task, as opposed to “solving datasets”
(i.e. overfitting the bias distribution shared across training and held-out testing sets, e.g.
as a result of the data collection process) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>
        Regrettably, for our submission we only had a chance to work on the English data.
Our approach was to restrict ourselves to considering how the tweets discussed their
topics, and not what they were talking about. Although the dataset comprises 30,000
tweets there are only 300 labels.2 We were concerned that a solution such as BERT
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] might focus on keywords that have statistical validity across both train and test, but
nevertheless overfit the dataset as a whole [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We were also interested to find features
that help us understand the phenomenon. We therefore consider predictive accuracy as
important but not all-encompassing, and favour solutions that are interpretable.3
      </p>
      <p>The features we chose are designed to measure thoughtfulness and emotional arousal.
Measurements were made from lexicons and constituency tree parses. For lexical
categories, we counted the frequency of words for each tweet, and then took the mean over
tweets per user. Formally, let the set of words in category c be Lc. For each user, u,
tokenize all 100 tweets yielding token vectors ti(u); i 2 [0; 100], and count the
proportion of words belonging to category c. Then average over all user tweets to get the final
frequency of usage per user:
f (u) =
c</p>
      <p>i=1
1 X100 Pt2ti(u) 1[t 2 Lc]
100</p>
      <p>
        :
t(u)
j i j
Constituency tree features were obtained with the Berkeley Neural Parser [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and will
be detailed below.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Thoughtfulness</title>
        <p>We reasoned that people who share more thoughtful content are less likely to spread
disinformation. We developed two sets of features to measure and investigate this
hypothesis: (1) the frequency of function word usage; (2) constituency tree features.
Function Word Usage Rates Higher usage rates of function words indicate more
complex sentences, and therefore may be seen to express thinking effort. The training data
shows some support for the existence of a threshold of function word usage rate above
which a user is unlikely to have spread disinformation.</p>
        <p>We use the open-source English function word lists provided by [12], and use the
categories: personal pronouns, adverbs, and all function words. Distributions of fc for
these categories are given in Figure 1. We included all three features in our final
classifier as cross-validation indicated they were useful, however the adverbs and personal
pronouns are questionable features. The average number of tokens in a tweet after our
tokenization method is just over 13. Yet the baseline usage rate for adverbs is 0:06 for
adverbs and 0:04 for personal pronouns. Therefore, the tweets are not long enough to
yield reliable statistics for these word categories. The complete function words category
2 The labels apply to the 300 authors, each of which contribute 100 tweets to the dataset.
3 In this respect, our choice to ensemble with XGBoost was questionable.
35
30
25
20
15
10
5
0
20
15
10
5
0
0.00
0.05Adverb0s.10
0.15
0.00</p>
        <p>0.05 0.10
Personal Pronouns
0.15
0.0
0.2 0.4
Function Words
0.6
is more reliable, with baseline frequency of 0:33. There appears to be a threshold for
function word usage rates above which the probability of spreading disinformation is
low in this dataset. This threshold appears to be quite high compared to the rest of the
distribution, suggesting this is not a feature with great power to solve this task on its
own.</p>
        <p>Constituency Tree Features Following the same intuition that sentence complexity
indicates thoughtfulness, we investigated related features from constituency tree parses.
Specifically, we calculated the average branching factor, and highest noun and verb
phrases in each constituency tree. We consider a tweet as a single sentence, which is
reasonable given that most tweets are headlines of news articles. From the constituency
trees we extract the average branching factor, and average maximum noun and verb
phrases as follows. Let Ni(u) be the set of internal non-leaf nodes in the constituency
tree for tweet i from user u. Let bf( ) count the number of children of a node. Then,
(u) =
bf</p>
        <p>0
i=1
1 X
(u)
jNi j n2Ni(u)</p>
        <p>1
bf(n)C :</p>
        <p>A
For noun and verb phrases, let Pi(u) be the set of sub-trees representing either type of
phrase, and let hP ( ) select the height of the tallest sub-tree in Pi(u) of phrase type P .
Then,
(u) =
hNP
(u) =
hVP
hNP (Pi(u))
hVP (Pi(u)):
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.4
0.3
0.2
0.1
0.0
2</p>
        <sec id="sec-2-1-1">
          <title>Avera4ge Branc6hing Fac8tor</title>
          <p>10</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>4 Max NP6Height 8</title>
          <p>The distributions over both labels are given in Figure 2. Once again the distributions are
not very well separated, but indicate some weak support for our hypothesis that higher
average branching factor and higher average noun phrases are more associated with the
control group above a (fairly high) threshold. Interestingly, the pattern is opposite to
what we expected for verb phrases, requiring further investigation.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Emotional Arousal</title>
        <p>We initially investigated emotional arousal using the Linguistic Inquiry and Word Count
(LIWC) dictionary [14]. However, due to being proprietary, we could not package it
into our final software solution. We are also advocates of open science and would like
to investigate alternatives to drive scientific progress, and provide resources equally
available to less privileged researchers. Nevertheless, the value of the LIWC dictionary
is beyond doubt, and still appears stronger than the alternatives we tried.4</p>
        <p>
          Our LIWC analysis indicated the usefulness of: anger, negative emotion, and
anxiety. The finding that anger is useful is consistent with previous work [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Using these
three features, a support vector classifier could achieve 70% cross-validated accuracy
on the training set. We therefore set ourselves the task of finding an equally good open
source lexicon. We did not find one for anxiety, and so skipped that category.
SentiWordNet For negative emotions, we settled on SentiWordNet [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Figure 3 shows
the distributions for sentiment (of both kinds), and negative sentiment. An unexpected
result is that more sentiment is associated with the control group, however negative
sentiment shows the expected pattern. Again, these features appear to have relatively
weak discriminative power, especially compared to the LIWC lexicon.
4 One exception is the function word categories, for which the lexicon from [12] appears to
yield very similar distributions. This is consistent with their findings, obtained on a different
task and dataset. Our findings in this respect therefore give extra weight to the claim that this
is a good open source alternative for the LIWC function word lexicon.
0.08 0.06 0.04 0.02
        </p>
        <p>Sentim0.0e0nt 0.02 0.04 0.06 0.08
0.02</p>
        <p>0.04 0.06
Negative Sentiment
0.08
0.10</p>
        <p>Anger Of all the LIWC features, anger provided the clearest separation between the
two classes, consistent with our understanding of how much disinformation works.
Anger is known as an emotion that leads to action (REF). Therefore, if
disinformation can bring someone to feel anger and outrage, it can promote its spread.5 We did
not find an open-source offering that could separate the two classes like LIWC does.6
Given the importance of this feature, our broader goal of finding alternatives to LIWC,
and general interest, we decided to experiment with making one. Our anger dictionary
is far from a thorough or complete work. Our only criteria for success was a distribution
that approximated the LIWC anger lexicon on this data. Since we used the training data
to tune the lexicon, it also very likely overfits.</p>
        <p>
          Inspired by previous work, we decided to exploit emoji on Twitter [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We used the
twint library7 to obtain tweets containing exclusively anger or joy emoticons in multiple
6-month time bins from the middle of 2020 to the end of 2014. We collection 20,000
tweets per emotion per time bin. We collected joy as a control, since in the average it
is unlikely anger words will be present in joyous posts. The point of collecting over
different times was to try and factor out the content words associated with anger at
different times. We found “Trump” to be the most popular correlate of anger going back
to before 2016, requiring a large number of timesteps to push him down our ranked list
of anger words.
        </p>
        <p>
          Utilizing a simple technique from social scientific research into media bias [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we
ranked words from this dataset by their 2 statistic comparing the frequency of each
word’s appearance in both anger and joy tweets. To account for variation over time, we
scaled the 2 statistic by its entropy over time periods, promoting stable anger words,
and demoting words whose correlation with each emotion is variable over time. The
final results appeared quite reasonable, with the top five words being: angry, bloody,
crap, disgusted, and unfair. We tuned the lexicon to the dataset by taking the top-50
5 Taiwan’s strategy for combating disinformation, “Humor over Rumor,” is based on this
understanding: humor being the best antidote to anger [16].
6 Our search is unlikely to have been exhaustive.
7 https://github.com/twintproject/twint.
0.01 0.00 0.01 0.02 Anger
0.03 0.04 0.05 0.06 0.07
        </p>
        <p>0 0.005 0.000 0.005 0.010Dist0ru.0s1t5 0.020 0.025 0.030
stems that were predictive of the “Spreader” class, as determined by average weight
given to each word of a linear support vector classifer using 5-fold cross-validation.
The resulting distribution of applying this lexicon is given in Figure 4. It does a much
better job of distinguishing the two classes.</p>
        <p>Distrust One recurrent theme in contemporary disinformation has been distrust in
institutions. We therefore experimented with a creating a “distrust” lexicon. We used the
2 statistics of unigrams comparing their frequencies in the two class labels, and
manually inspected the word list, pulling out words pertaining to distrust. The final lexicon
includes 36 words such as: fake, suspect, unbelievable, doubt, question, lie, and scam.
The resulting distributions are show in Figure 4. We generally see higher distrust scores
for spreaders of disinformation, suggesting this could be a promising feature.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        We use NLTK [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to tokenize the tweets, removing the special characters introduced by
the dataset authors, such as “#URL#.” The features described above are then calculated
for each tweet, and averaged over users, as given in the equations above.
      </p>
      <p>
        We separately trained support vector, random forest, and naive Bayes classifiers,
using 5-fold cross-validation to find optimal hyperparameters, using scikit-learn [13]. The
support vector classifier used an RBF kernel and C = 1:, achieving 69:7% accuracy on
the training set. The random forest uses 40 estimators with a max depth of 5, achieving
68:3% accuracy. The naive Bayes achieved 64:3% accuracy. We then ensembled these
classifiers with XGBoost [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which achieved 5-fold cross-validated accuracy of 89:3%
on the training set.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>It is very possible that the minor distributional differences in many of our features are an
artifact of random sampling and fail to generalize, particularly those relating to
thoughtfulness. However, cross-validation indicated their usefulness for this task, and the
intuitions behind many of the features seem reasonable. It is likely our use of XGBoost has
lead to overfitting due to the small number of labeled data points. We may also have
overfit the training set when building our anger and distrust lexicons.</p>
      <p>It would be much more interesting to compare these features across languages.
Perhaps due in part to our unfamiliarity with Spanish, we were unable to easily find
opensource resources to match what we have obtained for English.
12. Noble, B., Fernández, R.: Centre stage: How social network position shapes linguistic
coordination. In: Proceedings of the 6th Workshop on Cognitive Modeling and
Computational Linguistics. pp. 29–38. Association for Computational Linguistics, Denver,
Colorado (Jun 2015). https://doi.org/10.3115/v1/W15-1104,
https://www.aclweb.org/anthology/W15-1104
13. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
of Machine Learning Research 12, 2825–2830 (2011)
14. Pennebaker, J., Boyd, R., Jordan, K., Blackburn, K.: The development and psychometric
properties of liwc2015 (09 2015). https://doi.org/10.15781/T29G6Z
15. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling
Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff,
C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR
Workshop Proceedings (Sep 2020), CEUR-WS.org
16. Silva, S.: Coronavirus: How map hacks and buttocks helped taiwan fight covid-19. BBC
(2020), https://www.bbc.com/news/technology-52883838</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baccianella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining</article-title>
          .
          <source>In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)</source>
          .
          <source>European Language Resources Association (ELRA)</source>
          , Valletta, Malta (May
          <year>2010</year>
          ), http: //www.lrec-conf.org/proceedings/lrec2010/pdf/769_Paper.pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bandhakavi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiratunga</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>P</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Massie</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Generating a word-emotion lexicon from #emotional tweets</article-title>
          .
          <source>In: Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM</source>
          <year>2014</year>
          ). pp.
          <fpage>12</fpage>
          -
          <lpage>21</lpage>
          .
          <article-title>Association for Computational Linguistics</article-title>
          and Dublin City University, Dublin, Ireland (Aug
          <year>2014</year>
          ). https://doi.org/10.3115/v1/
          <fpage>S14</fpage>
          -1002, https://www.aclweb.org/anthology/S14-1002
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bird</surname>
            , Steven,
            <given-names>L.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klen</surname>
            ,
            <given-names>E.: Natural</given-names>
          </string-name>
          <string-name>
            <surname>Language Processing with Python. O'Reilly Media</surname>
            <given-names>Inc.</given-names>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>I.: Q</given-names>
          </string-name>
          &amp;
          <article-title>A: Taiwan's digital minister on combatting disinformation without censorship</article-title>
          . Committee to Protect Journalists (
          <year>2019</year>
          ), https://cpj.org/
          <year>2019</year>
          /05/ qa-taiwans
          <article-title>-digital-minister-on-combatting-disinfor/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Xgboost</surname>
          </string-name>
          .
          <source>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Aug</source>
          <year>2016</year>
          ). https://doi.org/10.1145/2939672.2939785, http://dx.doi.org/10.1145/2939672.2939785
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gentzkow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shapiro</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>What drives media slant? evidence from u.s. daily newspapers</article-title>
          .
          <source>Econometrica</source>
          <volume>78</volume>
          ,
          <fpage>35</fpage>
          -
          <lpage>71</lpage>
          (12
          <year>2006</year>
          ). https://doi.org/10.2139/ssrn.947640
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>An Emotional Analysis of False Information in Social Media and News Articles</article-title>
          .
          <source>ACM Transactions on Internet Technology (TOIT) 20(2)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Unlearn dataset bias in natural language inference by fitting the residual (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kitaev</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Constituency parsing with a self-attentive encoder</article-title>
          . In:
          <article-title>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <source>Association for Computational Linguistics</source>
          , Melbourne,
          <source>Australia (July</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Niven</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kao</surname>
          </string-name>
          , H.Y.:
          <article-title>Probing neural network comprehension of natural language arguments</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <fpage>4658</fpage>
          -
          <lpage>4664</lpage>
          . Association for Computational Linguistics, Florence,
          <source>Italy (Jul</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>P19</fpage>
          -1459, https://www.aclweb.org/anthology/P19-1459
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>