A Focused Analysis of Twitter-based Disinformation from
                   Foreign Influence Operations
                                                      Julio Amador Díaz López
                                                        Pranava Madhyastha
                                                         j.amador@imperial.ac.uk
                                                          pranava@imperial.ac.uk
                                                         Imperial College London
                                                               London, UK
ABSTRACT                                                                 et al. 2019; Vlachos and Riedel 2014; Wang 2017; Zubiaga et al.
Detection of foreign political influence operations is an important      2016]. Specifically to the study of morphologies is the work of
problem in the current era of high-information transaction. In this      [Kapusta and Obonya 2020] and [Zervopoulos et al. 2020]. [Ka-
paper, we present a focused study on disinformation from a foreign       pusta and Obonya 2020] use a corpus in Slovak and conclude that
influence campaign over twitter during the 2016 US presidential          pre-processing morphologies helps in classification performance.
election. We introduce a new dataset of political disinformation         [Zervopoulos et al. 2020] study disinformation content related to
related to a foreign influence operation on Twitter during the 2016      the protests in Hong-Kong and find significant differences in mor-
presidential campaign in the United States. We further analyze           phological variance between disinformation and other types of
the differences between information pushed forward by foreign            information and that such differences can be exploited to improve
agents and legitimate information concerning word usage. We also         classifier performance.
investigate the utility of subword level information for classifica-        In this paper, we focus on the 2016 US presidential election.
tion. Contrary to popular belief we observe that considering only        We are particularly interested in understanding word usage and
subword level information may lead to sub-optimal results.               relevance of subword information for detection. Towards this end,
                                                                         we present a new dataset of political disinformation on Twitter1 .
KEYWORDS                                                                 We analyze distributional representations to uncover the patterns
                                                                         associated with disinformation. We also compare the contribution
Disinformation, Twitter
                                                                         of word-level and character-level information in the context of
                                                                         more complex machine learning models for detection. Our primary
1   INTRODUCTION                                                         contributions in this paper are: a) we release a curated dataset aimed
As the spread and diffusion of fake news has reached the mainstream,     at detecting disinformation (Section 2); b) we present an analysis of
the detection of all kinds of disinformation, which are understood as    the word-usage in the context of disinformation during the 2016 US
pieces of information purposefully crafted to deceive, has attracted     elections (Section 3); c) our analysis reveals the potential limitations
significant interest from the NLP community. The detection of an         of sub-word units for deception detection (Section 3).
important type of disinformation campaign – foreign influence op-
erations – occupies academics and practitioners alike, particularly      2    DATA
so in times of an election. However, this is a very challenging task,    Our dataset is made up of two different parts: set (1) was collected
as the detection of any type of disinformation is particularly diffi-    between November 9th 2016 and March 31st 2017 using the follow-
cult even for humans. There is an urgent need to build automated         ing keywords: #MyVote2016, #ElectionDay, #electionnight, @
systems for detecting disinformation and stem its spread.                realDonaldTrump, @HillaryClinton to tweets related to the elec-
   Research in automated deception detection has made extensive          tion campaign. This collection yielded a total of 57, 379, 672 tweets.
use of textual features to detect disinformation. This research is       Set (2) was retrieved from [Linvill and Warren 2020] and consists
grounded on psychological and social scientific insights showing         of 2, 946, 220 tweets ranging from June 19th, 2015 to December 31st
that deceivers’ usage of language is often flawed. That is, when         2017. To ensure tweets corresponded only to the presidential cam-
deceivers try to craft their messages to imitate non-deceivers, fre-     paign in the United States, we restricted tweets in set (2) to those
quently it is found to contain language “leakages”. Such information     before March 31st 2017, yielding a total of 1, 244, 480. Of these, we
is extremely relevant for the detection of disinformation [Feng and      only retain original tweets (i.e., we purge ‘retweets’ or duplicate
Hirst 2013; Rubin 2017]. A well-known example of such leakages           mentions). It is important to note that the set (2) corresponds to
can be found in the AIDS disinformation campaign, where deceivers        accounts identified by the FBI as belonging to a foreign influence
used syntax that would not be otherwise used by native speakers;         campaign. For more details see [Linvill and Warren 2020].
e.g., “virus flu” against “flu virus” [Ellick and Westbrook 2018].          For the negative samples (samples which are not disinformation),
More recently, misspellings in political disinformation were found       we remove all tweets that have any author level content that corre-
to be particularly useful in flagging specific social media posts as     sponds to accounts in (1). We also use tweets only in English. To
originating from a bad actor spreading disinformation [Alba 2020].       1 While Twitter refers to accounts used by this research as spreading misinforma-
   Different research strands have focused on studying diverse as-       tion, we follow [Linvill and Warren 2020] and refer to these accounts as spreading
pects of disinformation; e.g. [Barrón-Cedeño et al. 2019; Monti          disinformation.


KnOD'21 Workshop - April 14, 2021
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
KnOD’21 Workshop, April 14, 2021, Virtual Event                                                                             Amador Díaz López and Madhyastha


ensure tweets in the sample are relevant, we restrict the tweets to                             Token              Word     Sub-word
those that belonged to the US as the geographical location in the                               Mean diff         0.879673   0.143676
metadata. Specifically, we restricted our sample to tweets that have                            realdonaldtrump    0.9902     0.6234
geolocation coordinates to be within the US. We used Twitter’s                                  clinton            0.9301     0.2890
API to ensure that tweets we considered were coming from users                                  obama              0.8745     0.0996
whose accounts have not been suspended by Twitter four years                                    fake               0.8428     0.2675
after the events and consider this to be a proxy for valid accounts.                            media              0.9011     0.1946
In specific, we called Twitter’s user API2 and eliminated accounts                   Table 1: Cosine distance between different tokens. Columns
that returned errors 50, user not found, and 63 and 64, suspended                    indicate word or sub-word distances. Last row represents the
accounts. This yielded 3, 324 tweets. Finally, we manually checked                   mean distance between tokens for word/sub-word.
these tweets to make sure their content was related to the 2016
presidential election.
    Next, we used random undersampling to balance the dataset. We
thus present a new dataset that consists of 6, 808 unique tweets                     the representation of ‘words’ as a combination of character bi-
(i.e., ‘retweets’ or duplicate mentions are purged) in English that                  grams. In this way, any word can be represented by the sum of
relates to the 2016 presidential election in the United States. The                  their character bi-grams. This is one of the predominant ways of
complete dataset has 16,193 tokens. Concerning categories, the                       increasing coverage and decreasing out-of-vocabulary words in the
dataset contains 3,324 tweets with 8,871 unique tokens labelled as                   literature [Liu et al. 2019; Sennrich et al. 2013].
legitimate information and 3,484 tweets with 10,434 unique tokens                        We present our analysis on the dataset in Table 1. Here our
labelled as disinformation. Finally, we removed strings beginning                    premise is that the distributional information captures word usage.
with the following characters: #, @, .@, and https://, and removed                   So if two words are used in similar ways, they should be very
emojis. This made the average length of the strings 10.8736 tokens.                  similar across the two classes, i.e., the cosine distance between
For our analyses, we normalized the text by converting all the                       them should be tending to 0. In Table 1, we notice that the mean
strings into lowercase.                                                              difference between similar words indicates that the word-usage
    Specific to the analyses in section 3, we partitioned the dataset                between the two classes are significantly different from each other.
into training (60%), development (30%) and tests (10%) sets. The                     This is especially true when we consider word-level representations.
training set has 2,083 tweets labelled disinformation, and 2,001                     We further notice that the sub-word level representations are much
labelled legitimate. The development set has 1,046 tweets labelled                   more closer than word-level information. We hypothesize that sub-
disinformation and 997 labelled legitimate. The test set contains                    word level information controls for morphological and typological
355 tweets labelled disinformation and 326 labelled legitimate.                      variation and thereby does not capture the diversity as well as
    The dataset is openly available here:                                            the word-level representations. We also present a few example
https://zenodo.org/record/4639608#.YF3wxi2ZPOQ.                                      words to illustrate the difference. We observe that the distances for
                                                                                     ‘realdonaldtrump’ and ‘clinton’ are significantly different, indicating
3    ANALYSIS                                                                        the diversity in contexts. However, we observe that the sub-word
In this section, we first present our analysis on the word-usage and                 level representations are generally closer to each other.
then expand on the utility of sub-word information. We further
present an analysis of word-based and character-based CNN models                        Word or sub-word representations? The above findings suggest
on our dataset.                                                                      that models which control for sub-word differences are sub-optimal.
                                                                                     We further examine this by using the partitioned dataset (i.e., parti-
    Word usage. Within the context of our research (i.e., foreign dis-               tioned into train, development and test sets) and building classifiers
information in the 2016 presidential election in the United States),                 with concatenated representations of LSA based word representa-
we aim to understand whether: “tweets containing disinformation                      tions. In this case, the representations for the words are obtained
have different word-usage patterns from those containing legiti-                     directly from LSA, while for the sub-word level, for each word
mate information.” To investigate this, we begin by exploring the                    we sum the representations of the character-bigrams. We specifi-
word co-occurrence space spanned by these tweets.                                    cally make use of Naive Bayes, Logistic Regression and SVM based
    We then use point-wise mutual information (PMI) to capture                       classifiers. We present our results in Table 2.
collocations and associations. We obtain two co-occurrence matri-
ces with PMI: one for disinformation and the other for legitimate
information. Each of the matrices was of size 5000×5000. We fur-                                           Word level         Sub-word level
ther reduce the dimensionality of the matrices with Latent Seman-                       Classifier    Accuracy F1 score       Test   F1 score
tic Analysis (LSA) to 300-dimensions resulting in matrices of size                      Logistic        0.6823     0.6824   0.6334    0.6329
5000×300 using. We finally measure the cosine-distance between                          Naive Bayes     0.5862     0.5858    0.6164    0.6136
all the 5000 words.                                                                     SVM             0.7275    0.7238     0.6236    0.6226
    We further use sub-word level representations following Bo-                      Table 2: Accuracy and macro F1 scores for the Logistic, Naive
janowski2017. For each of the 5000 most frequent words, we obtain                    Bayes and SVM classifiers. Classification was done using
2 See: https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-
                                                                                     word and sub-word level representations built section 3.
users/api-reference/get-users-lookup
A Focused Analysis of Twitter-based Disinformation from Foreign Influence Operations                                       KnOD’21 Workshop, April 14, 2021, Virtual Event


   We observe that compared to sub-word level representations,                                               Word level         Sub-word level
word-level representations appear to obtain better accuracy and                         Classifier      Accuracy Std Dev Accuracy Std Dev
F1-scores. We note that, while these results are not conclusive, they                   Logistic          0.8735     0.0185    0.8501     0.0110
seem to support the overarching theme regarding the utility of                          Naive Bayes       0.8499     0.0068    0.7980     0.0105
sub-word level information being sub-optimal.                                           SVM               0.9173    0.00487    0.8341     0.0071
   We further use best performing SVM based classifier that uses                        SVM stemmed       0.9071     0.0054    0.8165     0.0099
word-level representations and compared it to a frequently-used                        Table 4: Mean accuracy and standard deviations for the Lo-
CNN based based classifier for the task of detecting misinformation                    gistic, Naive Bayes and SVM classifiers using bag-of-words
in [Kim et al. 2016]. We use the classifier in the standard setup for                  and bag-of-characters representations.
our experiments3 .


            Classifier      Accuracy F1 score
            SVM word-level   0.7275     0.7238                                         4     CONCLUSION
            CNN word-level    0.6986    0.6970                                         This paper is a focused study on the disinformation campaign from
            CNN sub-word      0.7095    0.7088                                         a foreign influence operation in Twitter during the 2016 US presi-
Table 3: Accuracy, and macro F1 scores for the SVM and                                 dency election. We introduce a new dataset of political disinforma-
Character CNN classifiers. The SVM used representations in                             tion to explore differences between disinformation and legitimate
section 3. The Character CNN used fasttext word and sub-                               information. Our analysis of the dataset indicates divergent word-
word representations.                                                                  usage patterns between disinformation and legitimate information.
                                                                                       We also study the effect of sub-word patterns and its utility for
                                                                                       classification. Our results indicate that classifiers that only rely
                                                                                       on sub-word based information may have better coverage, but
   We note that the CNN classifier was trained using fasttext em-                      may control for morphological features. This may result in sub-
beddings, both, at the word and sub-word levels. We present our                        optimal performance. We hope that our dataset can help inform
results in Table 3.We observe that SVM classifier that uses word-                      novel insights relating to disinformation and propaganda and leads
level representations outperforms the more complex CNN based                           to development of better detection algorithms.
models. We further perform an in depth analysis of the CNN based
model and provide details in the appendix.                                             REFERENCES
                                                                                       Davey Alba. 2020. How Russia’s Troll Farm Is Changing Tactics Before the Fall
   Bag-of-words/characters representations. Results presented above                        Election. https://www.nytimes.com/2020/03/29/technology/russia-troll-farm-
may be driven by the particular representations built for the pre-                         election.html
                                                                                       Alberto Barrón-Cedeño, Israa Jaradat, Giovanni Da San Martino, and Preslav Nakov.
liminary analysis. In order to test the robustness of these findings,                      2019. Proppy: Organizing the news based on their propagandistic content. Infor-
we train the Naive Bayes, Logistic and SVM classifiers using bag-of-                       mation Processing and Management 56, 5 (sep 2019), 1849–1864. https://doi.org/10.
words and bag-of-character representations. The former is trained                          1016/j.ipm.2019.03.005
                                                                                       Adam B Ellick and Adam Westbrook. 2018. Opinion | Operation Infektion: A three-part
using word level uni-grams whereas the latter is trained using char-                       video series on Russian disinformation. https://www.nytimes.com/2018/11/12/
acter level unigrams, bi-grams and tri-grams in order to consider                          opinion/russia-meddling-disinformation-fake-news-elections.html
different sub-word representations. Both of them used TFIDF. To                        Vanessa Wei Feng and Graeme Hirst. 2013. Detecting deceptive opinions with profile
                                                                                           compatibility. Technical Report. 14–18 pages. http://tripadvisor.com.
underscore the robustness of our results, we use a 10-fold cross                       Jozef Kapusta and Juraj Obonya. 2020. Improvement of misleading and fake news
validation and average accuracy scores.                                                    classification for flective languages by morphological group analysis. Informatics 7,
                                                                                           1 (feb 2020), 4. https://doi.org/10.3390/informatics7010004
   The first three rows in Table 4 present the results. We note that                   Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware
changing the representations from dense to sparse does not change                          Neural Language Models. In Thirtieth AAAI Conference on Artificial Intelligence.
our results.                                                                               www.aaai.org
                                                                                       Darren L. Linvill and Patrick L. Warren. 2020. Troll Factories: Manufacturing Spe-
                                                                                           cialized Disinformation on Twitter. Political Communication (2020). https:
   Morphologies. To understand if morphologies are the contribut-                          //doi.org/10.1080/10584609.2020.1718257
ing factors for the classification performance, we further perform                     Zihan Liu, Yan Xu, Genta Indra Winata, and Pascale Fung. 2019. Incorporating word
experiments with an SVM based classifier that is trained on bag-                           and subword units in unsupervised machine translation using language model
                                                                                           rescoring. arXiv preprint arXiv:1908.05925 (2019).
of-stemmed-words, i.e, we first stem all the words to remove the                       Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, and Michael M.
morphological inflections and then train the bag of words based                            Bronstein. 2019. Fake News Detection on Social Media using Geometric Deep
                                                                                           Learning. (feb 2019). arXiv:1902.06673 http://arxiv.org/abs/1902.06673
classifier.                                                                            F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
   The last two rows of Table 4 shows that, by removing prefixes                           P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.
and suffixes from the corpus, the classifier using sub-word level rep-                     Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in
                                                                                           Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
resentations is affected the most. This suggest merely controlling for                 Victoria Rubin. 2017. Deception Detection and Rumor Debunking for Social Media.
morphologies rather than integrating morphological information                             FIMS Publications (jan 2017). https://ir.lib.uwo.ca/fimspub/92
may result in sub-optimal classification performance.                                  Rico Sennrich, Martin Volk, and Gerold Schneider. 2013. Exploiting Synergies Between
                                                                                           Open Resources for German Dependency Parsing, POS-tagging, and Morphological
                                                                                           Analysis. In Proceedings of the International Conference Recent Advances in Natural
                                                                                           Language Processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA, Hissar,
3 Further details regarding the setup are provided in the appendix                         Bulgaria, 601–609. https://www.aclweb.org/anthology/R13-1079
KnOD’21 Workshop, April 14, 2021, Virtual Event                                                                                           Amador Díaz López and Madhyastha


Andreas Vlachos and Sebastian Riedel. 2014. Fact Checking: Task definition and dataset   A APPENDIX
   construction. (2014), 18–22. http://www.politifact.com/
William Yang Wang. 2017. Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake      A.1 Word usage.
   News Detection. In Proceedings of the 55th Annual Meeting of the Association for
   Computational Linguistics (Volume 2: Short Papers). Association for Computational     We detail here how the representations used in section Word usage
   Linguistics, Stroudsburg, PA, USA, 422–426. https://doi.org/10.18653/v1/P17-2067      are built. We first consider 5, 000 most frequent words in all the
Alexandros Zervopoulos, Aikaterini Georgia Alvanou, Konstantinos Bezas, Asterios         corpus out of 16,193 words and then calculate the co-occurrence of
   Papamichail, Manolis Maragoudakis, and Katia Kermanidis. 2020. Hong Kong
   Protests: Using Natural Language Processing for Fake News Detection on Twitter. In    these words within each of the two categories separately (i.e., we
   IFIP Advances in Information and Communication Technology, Vol. 584 IFIP. Springer,   calculate the co-occurrence of the 5000 most frequent words in all
   408–419. https://doi.org/10.1007/978-3-030-49186-4_34
Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, and Michal Lukasik. 2016.
                                                                                         the corpus within tweets labelled disinformation and then within
   Stance classification in rumours as a sequential task exploiting the tree structure   those tweets labelled legitimate information)4 . We consider a word
   of social media conversations. In COLING 2016 - 26th International Conference on      𝑤 𝑗 to co-occur with word 𝑤𝑖 if 𝑤 𝑗 is within a window of 5 tokens
   Computational Linguistics, Proceedings of COLING 2016: Technical Papers.
                                                                                         to the left or right of 𝑤𝑖 . We consider a wide window to capture
                                                                                         the differences in word-usage. Figure 1 shows the vector spaces for
                                                                                         disinformation, legitimate information and the complete dataset.

                                                                                         A.2       Classifiers
                                                                                         To perform experiments with the Naive Bayes, Logistic and SVM
                                                                                         classifiers we made use of the sklearn [Pedregosa et al. 2011]
                                                                                         package. Furthermore, hyperparameters of all of the classifiers
                                                                                         were chosen according to the development set. In particular, the
                                                                                         parameters C, penalty, and fit_intercept were tuned for the
                                                                                         logistic classifier, the parameters, C, gamma, and kernel were tuned
                                                                                         for the SVM classifier, and the parameters alpha and fit_prior
                                                                                         were tuned for the Naive Bayes classifier.

                                                                                         A.3       Character CNN
                                                                                         Here we detail the Character CNN used in section 3:
                                                                                             • Character CNN inputs: Embeddings initialised with fasttext
                                                                                               + 1D Convolution with 3 filters of size 2 and a tanh activation
                                                                                               + 1D Convolution with 4 filters of size 3 and a tanh activation
                                                                                               + 1D Convolution with 5 filters of size 5 and a tanh activation
                                                                                               + Max-pooling over time layer + dense layer using sigmoid
                                                                                                activation.
                                                                                            The number of hidden units, dropout, and the learning rate were
                                                                                         tuned using uniform random sampling. Moreover, the CharCNN
                                                                                         was trained for 100 epochs. Details related to the hyperparameters
                                                                                         can be found in the code. Furthermore, Figure 3 plots loss, training
                                                                                         and validation accuracy. It is important to underscore that the Char-
                                                                                         CNN is far more complex with just over 11K parameters, whereas
                                                                                         the SVM used only 3.6K parameters (average tweet length of 12
                                                                                         times 300 dimensional representations flattened).


                                                                                         4 Further explorations were done with 1000, 2000, and 8000 most frequent tokens. We
                                                                                         note that it did not lead to any significant differences.
A Focused Analysis of Twitter-based Disinformation from Foreign Influence Operations                            KnOD’21 Workshop, April 14, 2021, Virtual Event


Figure 2: Average loss, train and validation accuracy for the
Character CNN using word level embeddings as inputs.


                                                                                       Figure 1: Visual representation of word vector spaces. Fig-
                                                                                       ure in the upper section contains the visual representation
                                                                                       of every token in the corpus. Figures in the middle and lower
                                                                                       sections contain the visual representation of the tokens in
                                                                                       the legitimate and disinformation subsets of the dataset re-
                                                                                       spectively.


Figure 3: Average loss, train and validation accuracy for the
Character CNN using subword level embeddings as inputs.