Multilingual Detection of Fake News Spreaders via
               Sparse Matrix Factorization
                           Notebook for PAN at CLEF 2020

                       Boško Koloski1,2 , Senja Pollak1 , and Blaž Škrlj1
                                  1
                                   Jožef Stefan Institute, Ljubljana
              2
                  Faculty of Information Science - University of Ljubljana , Slovenia
                                          blaz.skrlj@ijs.si


        Abstract Fake news is an emerging problem in online news and social media.
        Efficient detection of fake news spreaders and spurious accounts across multiple
        languages is becoming an interesting research problem, and is the key focus of
        this paper. Our proposed solution to PAN 2020 fake news spreaders challenge
        models the accounts responsible for spreading the fake news by accounting for
        different types of textual features, decomposed via sparse matrix factorization,
        to obtain easy-to-learn-from, compact representations, including the information
        from multiple languages. The key contribution of this work is the exploration of
        how powerful and scalable matrix factorization-based classification can be in a
        multilingual setting, where the learner is presented with the data from multiple
        languages simultaneously. Finally, we explore the joint latent space, where pat-
        terns from individual languages are maintained. The proposed approach scored
        second on the 2020 PAN shared task for identification of fake news spreaders.


1     Introduction

The notion of fake news refers to distortions of news with the intention to affect the
political landscape and to create confusion and divisions in society. Even if the phe-
nomenon of fake news is not new, the scale and impact of fake news has never been so
important than today, which can be attributed to the digital transformation of the news
industry, and especially to the rise of social media as a news distribution channel. [6]
     One of the crucial problems is the recognition of fake news spreaders. For example,
Twitter bots (fake accounts) are capable of generating fake information and propagating
it through their follower networks, which can impact real-life entities such as stock
markets and possibly even elections [4]. Automatic detection of such spreaders is thus
becoming one of the key approaches to minimize the manual annotation costs employed
by the social media owners. This work fits under the framework of the PAN author
profiling tasks [21,19], and describes our approach submitted to the PAN 2020 shared
task on Profiling Fake News Spreaders on Twitter [22].

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
    loniki, Greece.
    This paper is structured as follows. In Section 2 presents related work, Section 3, we
discuss the problem addressed in this work. Next, in Section 4, we discuss the proposed
method, followed by empirical evaluation and discussion.


2   Related work
A critical mass of fake news can have serious, real-life consequences, and can for exam-
ple impact election process [3]. Distinguishing between real and fake news content has
been addressed by linguistic approaches focusing on text properties, such as the writ-
ing style and content [18] and by network approaches, where using network properties
and behavior are ways to complement content-based approaches that rely on deceptive
language and leakage cues to predict deception. [1] A very relevant subtopic of fake
news research is detection of fake news spreaders. Commonly, fake news spreaders
are implemented as bots [23], and are able to carry out the spreading process in com-
pletely automated manner. It is still researched, whether active prevention of fake news
spreading is a viable tactic, and to what extent it can be implemented in real-life online
systems [15]. Further, previous PAN submissions on the topic of bot prediction indicate
(e.g., [11]), that the best models perform well when different types of textual features,
entailing semantic, as well as morphological information, are used.
    Twitter fake news spreaders can be captured in their own social bubbles, which was
shown to be an efficient defense tactic [10]. Here, simple tweet frequency distributions
were already indicative of spurious behavior. Classification via features, such as the
account age and similar was also shown to work well [7]. In a recent survey [24], the
authors emphasize that fact-checking is an important step in maintaining online social
media quality. By employing automated systems, capable of prioritizing potentially
interesting users, less time is spent on manual curation, which can be an expensive and
time-consuming process.
    Traditional classifiers with extensive feature engineering seem to be pervasive in
the literature about distinguishing between bots and humans but there was also some
attempts to tackle the task with neural networks. In the recent work, [5] proposed a
behavior enhanced deep model (BeDM) that regards user content as temporal text data
instead of plain text and fuses content information and behavior information using a
deep learning method. They report an F1-score of 87.32% on a Twitter-related dataset.
Finally, low-dimensional representations have recently been shown to perform well for
social media-based profiling [20].


3   Problem description
Provided a timeline of chosen tweets of ground truth labeled data consisting of fake
news spreaders and non-spreaders, the goal is to decide if a new author is a spreader of
fake news or not. Formally, we are given a decision problem which states:
Given an author A who tweets in language L ∈ {English ∨ Spanish} and from the
collection of tweets C, given a subset of tweets CA (of an author A),

             CA = t1 , t2 , . . . , tn   where   ti   represents a tweet content,
find a decision function that maps f : CA 7→ author reliability, hence
                                 (
                                   0     a non fake-news spreader;
                    f (C(A)) =
                                   1     a fake-news spreader;
This decision problem is specialization of the problem of author profiling. It requires
learning a representation from CA , suitable for approximating f . The provided data
consists of tweets by 300 English and 300 Spanish authors respectively, respectively.
    For each author 100 tweets are provided making a total of 300000 English and
300000 Spanish tweets. The balance of classes is consistent for both languages, both
having 150 negative and 150 positive samples, as shown in Table 1.


                               Table 1. Dataset distributions

                             Language spreaders non-spreaders
                              English    150         150
                              Spanish    150         150


4     Method description
The following section includes description of the proposed method with the correspond-
ing intermediate steps.

4.1    Pre-processing
First, the tweets from each author are concatenated, and only the printable characters
are kept, which means no non-printable characters are preserved. Data pre-processing
for both English and Spanish includes the following steps:
 1. From the original data punctuation is removed
 2. URL and hashtags are removed from the result of step (1)
 3. stop-words are removed from the output of step (2).

4.2    Automatic feature construction
For each author’s collection of tweets we initially define a collection of candidate n fea-
tures from the pre-processed data which are iteratively selected and weighted, similarly
to Martinc et. al. [12]. Features generated in the construction are based on choosing
following feature types:
    – character based: each of the texts is tagged with character n-grams of size 2 and
      3 characters and generates a predetermined maximum allowed number of features
      ranging from n2 up to 15000 features.
    – word based: each of the texts is tagged with word n-grams of size 1 and 2 words
      and generates a preconditioned maximum allowed number of features ranging from
      n
       2 up to 15000 features.
At this we have prepared word and character features from each author’s collection of
tweets, ready to be used in the feature selection step.
4.3     Dimensionality reduction via matrix factorization
Next, we perform sparse singular value decomposition (SVD)1 [8] that can be summa-
rized via the following expression:

                                        M = U ΣV T .

The final representation (embedding) E is obtained by multiplying back only a portion
of the diagonal matrix (Σ) and U , giving a low-dimensional, compact representation
of the initial high dimensional matrix. Note that E ∈ R|D|×d , where d is the number
of diagonal entries considered. The obtained E is suitable for a given down-stream
learning task, such as classification (considered in this work). Note that performing
SVD in the text mining domain is also commonly associated with the notion of latent
semantic analysis.

4.4     Classifier selection
Classification model we aimed for in this task was to be robust yet highly flexible,
one that will score well on the prepared data without using many features or extensive
processing power. Following this goal we conducted a series of experiments, trying dif-
ferent representations with corresponding linear models as presented in Section 5. The
classifiers used were the following (from scikit-learn [17]): Random Forest, Logistic
Regression and the Support Vector Machines [9].


5      Conducted experiments
Considering the size of the dataset and the distribution of the data within the dataset,
we preformed a series of experiments. All of them aimed to test the pipeline described
in the Section 4. The experiments conducted can be divided into two main categories,
based on the language considered by a given model:
 1. Multilingual - Both languages’ data is fused together and is subject to the same
    feature construction and representation creation steps.
 2. Monolingual - For each language in the dataset, English and Spanish, we create a
    separate pipeline, that is also executed exclusively on the data from a given lan-
    guage.
    For both approaches we performed extensive grid search over parameter space to
find best hyper-parameter configuration with the help of Scikit’s Learn GridSearchCV
function. By doing 10-fold cross validation, the grid consisted of reducing the dimen-
sions parametrized by k in the following interval:

                               k ∈ [128, 256, 512, 640, 768, 1024]

and the number of generated n features from the interval

                           n ∈ [2500, 5000, 10000, 20000, 30000].
 1
     https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Monolingual variant was based on splitting the data from each language separately into
training 90% and 10% validation set, obtaining 270 training examples Ctraining and 30
validation examples Cvalidation . Such splits were obtained for each language. Only train-
ing data was used for feature construction and dimensionality reduction.
Multilingual variant merged the data from both languages and after that the same ap-
proach as previously was applied. Merging the data from both languages potentially
reduces the computational load required to train two separate models. Data was split
into training 90% and 10% validation set, obtaining 540 training examples Ctraining and
60 validation examples Cvalidation . In each iteration we generated n features in R540xn ,
reduced them to dimension k obtaining a matrix from the space R540xk .
                                          SVD
      g(Ctraining , n features) : RN xn −−→ RN xk       where g denotes the 4.3 process.
Once constructed, the feature space was subject to learning. We experimented with both
logistic regression and linear SVMs and in initially some experiments were conducted
with RandomForest model, of which hyperparameters we optimized in 5-fold cross
validation considering the size of the dataset. Finally, we tested the performance on the
Cvalidation set.


         Figure 1. English Distribution                  Figure 2. Spanish Distribution


                                 Figure 3. Merged Distribution


Figure 4. Visualization of the latent spaces used to train the final models. The orange color cor-
responds to spread and the blue to non-spreader. The plots indicate the number of clusters is
maintained in the latent space.


    We visualise the distribution of the dataset reduced to 2 dimensions using UMAP
[13] dimensionality reduction in Figure 4. Figures 1 and 2 represent the visualization
with the best monolingual model described in Chapter 6, Figure 3 represents the joint
latent space generated by the multilingual model described in the same chapter.


6   Results

We constructed two baselines one that was based on TF-IDF on Logistic Regression
(LR) with L1 regularization and the second was doc2vec modeled with RandomForest
(RF) as classifier. The array of experiments conducted yielded the results presented in
Table 2, and the outcomes of our final submission in Table 3.
    As discussed in Section 5 all training was conducted by using Ctraining data and the
validation was done on Cvalidation set. The next presented Table 2 shows the model results
as measured on TIRA training evaluation on the whole C validation ∪ C training data.


                 name       type #features #dimensions model EN ACC ES ACC
             tfidf_large    multi 5000            768         LR   0.9633 0.9867
      tfidf_tweet_tokenizer multi 5000            768         LR   0.9633 0.9533
             tfidf_small    mono 5000             512     SVM,SVM 0.9700 0.4900
               tfidf_cv     mono 10000            768     SVM,SVM 0.9100 0.9367
           tfidf_no_hash    multi 10000           768         LR   0.9300 0.9067
         doc2vec_baseline mono 100                 #        RF,SVM 0.6428 0.6971
        tfidf_tpot_baseline mono 30000             #        LR,SVM 0.7500 0.7400
           tfidf_baseline   mono 10000             #         LR,LR 0.5567 0.7033
                            Table 2. Final training data on TIRA.


    The final un-official evaluation as reported on TIRA’s page is presented in Table 3.


               name    type #features #dimensions model EN ACC ES ACC
           tfidf_large multi 5000           768          LR        0.7150 0.7950
             tfidf_cv mono 10000            768      SVM,SVM 0.7000 0.7950
                     Table 3. Un-official evaluation on test data on TIRA


    The Model column in Table 2 refers to the classifiers used, such that if two classifiers
are present the model is monolingual - the first classifier is for English and the second
one for Spanish and in case the model is multilingual only one classifier is used. The
type column discriminates between the number of languages the model is trained on.
Name column consists of vectorizer used and is followed by dimension size or type of
tokinizer used or, dimensions column denotes the number of dimenstions SVD reduces
to.
     As it can be seen the highest evaluation score on our training data was obtained
by the multilingual model tfidf_large, with the following hyper-parameters: k = 768
dimensions, n = 5000 features, Logistic Regression classifier with λ 2 = 0.002 and
fit_intercept= F alse.
     Monolingual model that preformed best is tfidf_cv which for English is parama-
trized as SVM model with the following hyper-parameters: α = 0.001, λ 1 = 0.8 while
penalizing elastic-net, loss-function = hinge and power_t = 0.5 and for Spanish of SVM
model with hyper-parameters: α = 0.0005, λ1 = 0.25 while penalizing elastic-net, loss-
function = hinge and power_t = 0.9 .
     The more detailed insight into the performance of the best performing models over
the inference of the number of word and char n-grams and the accuracy on the 5fCV of
the models is also given in Figures 5 and 6. The figures show the performance of the
best mono and multilingual models – the confidence intervals indicate the variability
obtained when repeating the experiments


Figure 5. Best monolingual model on eval data. Figure 6. Best multilingual model on eval data.


7   Availability

The code and the pilot experiments are freely available at https://gitlab.com/skblaz/pan2020.


8   Discussion and Conclusions

The series of experiments conducted as a part of this work indicates, n-grams for the
task of Author Profiling are still sufficient and method compared to more complex meth-
ods as transformers and word2vec [14] alike, which can easily overfit when considering
only hundreds of instances. As part of the initial experiments, we also attempted to
include semantic features [25], however, the results were not significantly better (nor
worse), but only added to the computational time, hence such features were omitted
from the final solution. We tried to change the feature space by trying different NLTK
[2] tokenizers - TweetTokenizer and the TPOT [16] automatic model generation and
selection, however results obtained were similar to the ones obtained by manual con-
struction. The joint vector space, obtained by merging the data from both languages
maintains the patterns, observed when projecting individual language data sets, indi-
cating merging of the data is a suitable tactic that does not result in complete loss of
information.
     Further on we can focus on exploring the possibility for detecting fake news pro-
files across different languages by first considering Latent Semantic Analysis across
different language settings, further segmenting the semantic space prior to learning.


9   Acknowledgements
The work of the last author was funded by the Slovenian Research Agency through a
young researcher grant. The work of other authors was supported by the Slovenian
Research Agency (ARRS) core research programme Knowledge Technologies
(P2-0103), an ARRS funded research project Semantic Data Mining for Linked Open
Data (financed under the ERC Complementary Scheme, N2-0078) and European
Unionś Horizon 2020 research and innovation programme under grant agreement No
825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented
Languages in European News Media).


References
 1. Automatic deception detection: Methods for finding fake news. Proceedings of the
    Association for Information Science and Technology Computer Science (2016)
 2. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media
    (2009)
 3. Bovet, A., Makse, H.A.: Influence of fake news in twitter during the 2016 us presidential
    election. Nature communications 10(1), 1–14 (2019)
 4. Brigida, M., Pratt, W.R.: Fake news. The North American Journal of Economics and
    Finance 42, 564–573 (2017)
 5. Cai, C., Li, L., Zengi, D.: Behavior enhanced deep bot detection in social media. In: 2017
    IEEE International Conference on Intelligence and Security Informatics (ISI). pp. 128–130.
    IEEE (2017)
 6. Ghanem, B., Rosso, P., Rangel, F.: An Emotional Analysis of False Information in Social
    Media and News Articles. ACM Transactions on Internet Technology (TOIT) 20(2), 1–18
    (2020)
 7. Gilani, Z., Kochmar, E., Crowcroft, J.: Classification of twitter accounts into automated
    agents and human users. In: Proceedings of the 2017 IEEE/ACM International Conference
    on Advances in Social Networks Analysis and Mining 2017. pp. 489–496. ACM (2017)
 8. Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: Probabilistic
    algorithms for constructing approximate matrix decompositions (2009)
 9. Hearst, M.A.: Support vector machines. IEEE Intelligent Systems 13(4), 18–28 (Jul 1998).
    https://doi.org/10.1109/5254.708428, https://doi.org/10.1109/5254.708428
10. Lee, K., Caverlee, J., Webb, S.: Uncovering social spammers: social honeypots+ machine
    learning. In: Proceedings of the 33rd international ACM SIGIR conference on Research and
    development in information retrieval. pp. 435–442. ACM (2010)
11. Martinc, M., Blaž Škrlj Pollak, S.: Fake or not: Distinguishing between bots, males and.
    CLEF 2019 Evaluation Labs and Workshop – Working Notes Papers (2019)
12. Martinc, M., Skrlj, B., Pollak, S.: Multilingual gender classification with multi-view deep
    learning: Notebook for PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J., Soulier, L.
    (eds.) Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum,
    Avignon, France, September 10-14, 2018. CEUR Workshop Proceedings, vol. 2125.
    CEUR-WS.org (2018), http://ceur-ws.org/Vol-2125/paper_156.pdf
13. McInnes, L., Healy, J., Saul, N., Großberger, L.: Umap: Uniform manifold approximation
    and projection. Journal of Open Source Software 3(29), 861 (2018).
    https://doi.org/10.21105/joss.00861, https://doi.org/10.21105/joss.00861
14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
    words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M.,
    Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing
    Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013),
    http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-
    their-compositionality.pdf
15. Mustafaraj, E., Metaxas, P.T.: The fake news spreading plague: was it preventable? In:
    Proceedings of the 2017 ACM on web science conference. pp. 235–239 (2017)
16. Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H.:
    Applications of Evolutionary Computation: 19th European Conference, EvoApplications
    2016, Porto, Portugal, March 30 – April 1, 2016, Proceedings, Part I, chap. Automating
    Biomedical Data Science Through Tree-Based Pipeline Optimization, pp. 123–137.
    Springer International Publishing (2016)
17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
    Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
    of Machine Learning Research 12, 2825–2830 (2011)
18. Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detection of fake
    news. In: Proceedings of the 27th International Conference on Computational Linguistics.
    pp. 3391–3401. Association for Computational Linguistics, Santa Fe, New Mexico, USA
    (Aug 2018), https://www.aclweb.org/anthology/C18-1287
19. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World.
    Springer (Sep 2019)
20. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for
    language variety identification. In: International Conference on Intelligent Text Processing
    and Computational Linguistics. pp. 156–169. Springer (2016)
21. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling
    Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff,
    C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers.
    CEUR-WS.org (Sep 2020)
22. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling
    Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff,
    C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR
    Workshop Proceedings (Sep 2020), CEUR-WS.org
23. Shao, C., Ciampaglia, G.L., Varol, O., Flammini, A., Menczer, F.: The spread of fake news
    by social bots. arXiv preprint arXiv:1707.07592 96, 104 (2017)
24. Zhou, X., Zafarani, R.: Fake news: A survey of research, detection methods, and
    opportunities. arXiv preprint arXiv:1812.00315 (2018)
25. Škrlj, B., Martinc, M., Kralj, J., Lavrač, N., Pollak, S.: tax2vec: Constructing interpretable
    features from taxonomies for short text classification. Computer Speech & Language p.
    101104 (2020)