=Paper= {{Paper |id=Vol-2696/paper_120 |storemode=property |title=KU-CST at the Profiling Fake News spreaders Shared Task |pdfUrl=https://ceur-ws.org/Vol-2696/paper_120.pdf |volume=Vol-2696 |authors=Manex Agirrezabal |dblpUrl=https://dblp.org/rec/conf/clef/Agirrezabal20 }} ==KU-CST at the Profiling Fake News spreaders Shared Task== https://ceur-ws.org/Vol-2696/paper_120.pdf
    KU-CST at the Profiling Fake News spreaders Shared
                           Task
                       Notebook for PAN at CLEF 2020


                                    Manex Agirrezabal

                          Centre for Language Technology (CST)
                       Department of Nordic Studies and Linguistics
                     University of Copenhagen / Københavns Universitet
                                2300 Copenhagen (Denmark)
                         manex.aguirrezabal@hum.ku.dk




       Abstract In this document we present our approach for profiling fake news
       spreaders. The model relies on semantic features, part-of-speech tag related fea-
       tures and other simple features. We have reached an accuracy of 0.697 and 0.810
       for English and Spanish, respectively, on validation data. Test accuracies using
       these same models reach 0.690 and 0.725 for English and Spanish data. We be-
       lieve that this is a simple and robust model that could potentially be used as a
       baseline for this task.



1    Introduction

In this paper, we present our method for the Shared Task on profiling fake news spread-
ers [12]. The method that we present here is a relatively simple model that could be
seen as a simple baseline that relies on semantics, word classes and some other simple
features. All the code is available at this repository.1
     We expect that the topics (or meaning) that a fake news spreader covers will differ
from the ones that other users cover. Besides, we expect that the used part-of-speech
(POS) tags will be good predictors, as this is a common source for author profiling. We
also included the average tweet length in characters and also the uppercase/lowercase
letter ratio, with the expectation to capture fake news spreaders.
     This document is structured as follows. First, we mention the resources that we have
employed. Then, we give more information about how the representation of each user
is built. We continue mentioning the classifiers that we have tested. Finally, we discuss
the results and provide some insights for possible future directions.

   Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
   cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
   loniki, Greece.
 1
   https://github.com/manexagirrezabal/PAN-PFNS2020
2    Resources

We have trained our models on the data published by the organizers of the Shared Task
on Profiling Fake News Spreaders2 [13]. This data set contains a tweet feed of 100
tweets of 300 different users. 150 out of 300 users are fake news spreaders.
    In order to build the required user representations, we employed a number of re-
sources that are presented below.
    Currently, semantic representations are built using word embeddings, for which we
employed a collection trained on Twitter [4]. The authors of these embeddings include
representations for several languages3 , and English and Spanish are among them.
    For the part-of-speech (POS) tagger, we decided to build our own POS tagger, as
commonly used POS taggers may not work so well with the language from Twitter
because of shortened words, specific slang, and so on. We built a Hidden Markov Model
POS tagger [2,10] trained on Twitter data [3,14].4,5


3    Representation of each user

Following the expectations mentioned at the beginning of the article, we assume that
using the average word embedding representation from all words that a user has writ-
ten, we get an approximation of the semantic content that they published. Hence, we
represent a user as an average embedding (200 dimensions). We also include the stan-
dard deviation of each dimension. We do not do any further lemmatization, stemming
or preprocessing to the tweets.
    We include a bag-of-pos, which encodes the frequency of each part-of-speech, but
we normalize it by dividing with the most frequent part-of-speech frequency, and hence,
all numbers are at the range 0 . . . 1. While in the English version this bag contains 53
different tags, the Spanish tagger can capture 18 different tags.
    We then add some commonly used simple features, such as, the average length of
tweets in characters, and also the ratio of uppercase letters. We calculate this last number
by just counting the uppercase letters and dividing them by the sum of the uppercase
and lowercase letters.


4    Classifiers

We decided to use two linear classifiers, especially Logistic Regression and Linear
SVM, as they can be trained very fast, and they give a good insight of how a set of
features work. We further decided to include a non-linear model, such as the Multilayer
Perceptron and the Random Forest, because of its popularity in text classification tasks.
The MLP classifier was trained with three hidden layers of size 50. All other classifiers
were trained using the default parameters from the Scikit-learn package [9].
 2
   https://zenodo.org/record/3692319
 3
   https://www.spinningbytes.com/resources/wordembeddings/
 4
   https://gate.ac.uk/wiki/twitter-postagger.html
 5
   https://www.clarin.si/repository/xmlui/handle/11356/1078
5     Results on development and test data

In the table below we can see the results of the different classifiers. We validated the
models using Stratified K-Fold Cross-Validation with K = 5.

                        Classifier             Accuracy
                                             English Spanish
                        Most frequent         0.500 0.500
                        Logistic Regression   0.677 0.720
                        Linear SVM            0.503 0.550
                        Multilayer Perceptron 0.677 0.703
                        Random Forest         0.697 0.810

    Considering these results, we decided to use the Random Forest model as our fi-
nal model for testing. We got a test accuracy of 0.690 for English data and 0.725 for
Spanish.


Further experiments

We further experimented including grammatical errors. We included information about
misspellings by using the Python package pyspellchecker6 to detect whether there
are misspelled words, and afterwards, we control which letter becomes which letter.
Therefore, if the alphabet has 27 letters, we create a vector of 272 numbers and we
save how often each variation happens. The goal of this representation was to capture
systematic errors that a user may make, hoping that this would be representative.
    The results of this experiment can be seen below, using the same classifiers as be-
fore, and also validated under the same conditions (Stratified K-Fold Cross-Validation
with K = 5).

                        Classifier             Accuracy
                                             English Spanish
                        Logistic Regression   0.577 0.720
                        Linear SVM            0.570 0.677
                        Multilayer Perceptron 0.600 0.693
                        Random Forest         0.720 0.773

   Unfortunately, these last test was done out of competition time, and therefore, we
could not test this models performance on test data.


6     Discussion and Future work

In this paper, we presented a model that could potentially be used for capturing fake
news spreaders. Considering average accuracy, our model ranked 31st out of 66 partic-
ipants in the competition. The ranking also includes five different baselines: A model
 6
     https://pypi.org/project/pyspellchecker/
that was used for language variety identification [11], an SVM trained on character
n-grams, a Neural Network trained on word n-grams, an emotion based model (Emo-
tionally Infused Network) [5], an LSTM-based implementation and a random classifier.
Our model performs better than the last three baselines, but it is still worse than [11]
and the character-based SVM.
    In the box plot below, we illustrate how our model performs compared to the other
participants. Note that outliers such as authors that have not participated in specific
language configurations, have been discarded.




    The presented model is relatively simple and efficient, but we believe that results
can still be improved. We mention here some possible future directions.
    As the character-based SVM model performs better than our model, we believe that
adding character-aware representations can boost our performance. This could be done
either using character n-grams or using a character-based Recurrent Neural Network to
build representations.
    Apart from that, we have not done any preprocessing in this work. Considering
the language use on Twitter [1,6], we believe that having a normalization step could
improve our results. We could also perform lemmatization or stemming. By doing this,
the number of retrieved embeddings would be expected to be much higher.
    In the current work, we trained a very simple Hidden Markov Model for POS tag-
ging. This model may fall short because of the high number of misspellings in so-
cial media language. This effect could be reduced by training a POS tagger using a
character-level tagger, such as a BiLSTM+CRF model [7,8].


Acknowledgements
The author would like to acknowledge all participants from the course Language Pro-
cessing 2 (spring semester on 2019/2020) at the M.S. program in IT & Cognition at the
University of Copenhagen, as the majority of the ideas presented here flourished in the
discussions during class.
References
 1. Alegria, I., Aranberri, N., Comas, P.R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I.,
    Turmo, J., Zubiaga, A.: Tweetnorm: a benchmark for lexical normalization of spanish
    tweets. Language resources and evaluation 49(4), 883–905 (2015)
 2. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with
    the natural language toolkit. " O’Reilly Media, Inc." (2009)
 3. Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all:
    Overcoming sparse and noisy data. In: Proceedings of the International Conference on
    Recent Advances in Natural Language Processing. Association for Computational
    Linguistics (2013)
 4. Deriu, J., Lucchi, A., De Luca, V., Severyn, A., Müller, S., Cieliebak, M., Hofmann, T.,
    Jaggi, M.: Leveraging large amounts of weakly supervised data for multi-language
    sentiment classification. In: Proceedings of the 26th international conference on world wide
    web. pp. 1045–1052 (2017)
 5. Ghanem, B., Rosso, P., Rangel, F.: An Emotional Analysis of False Information in Social
    Media and News Articles. ACM Transactions on Internet Technology (TOIT) 20(2), 1–18
    (2020)
 6. Gupta, I., Joshi, N.: Tweet normalization: A knowledge based approach. In: 2017
    International Conference on Infocom Technologies and Unmanned Systems (Trends and
    Future Directions)(ICTUS). pp. 157–162. IEEE (2017)
 7. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural
    architectures for named entity recognition. In: Proceedings of the 2016 Conference of the
    North American Chapter of the Association for Computational Linguistics: Human
    Language Technologies. pp. 260–270 (2016)
 8. Ling, W., Dyer, C., Black, A.W., Trancoso, I., Fermandez, R., Amir, S., Marujo, L., Luís,
    T.: Finding function in form: Compositional character models for open vocabulary word
    representation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural
    Language Processing. pp. 1520–1530 (2015)
 9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
    Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
    of Machine Learning Research 12, 2825–2830 (2011)
10. Rabiner, L., Juang, B.: An introduction to hidden markov models. ieee assp magazine 3(1),
    4–16 (1986)
11. Rangel, F., Franco-Salvador, M., Rosso, P.: A Low Dimensionality Representation for
    Language Variety Identification. In: International Conference on Intelligent Text Processing
    and Computational Linguistics. pp. 156–169. Springer (2016)
12. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling
    Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff,
    C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR
    Workshop Proceedings (Sep 2020), CEUR-WS.org
13. Rangel, F., Roso, P., Ghanem, B., Giachanou, A.: Profiling fake news spreaders on twitter
    (Feb 2020), https://doi.org/10.5281/zenodo.3692319
14. Rei, L., Mladenic, D., Krek, S.: A multilingual social media linguistic corpus. In:
    Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities,
    Ljubljana, Slovenia (2016)