=Paper= {{Paper |id=Vol-2696/paper_242 |storemode=property |title=Fusing Stylistic Features with Deep-learning Methods for Profiling Fake News Spreader |pdfUrl=https://ceur-ws.org/Vol-2696/paper_242.pdf |volume=Vol-2696 |authors=Roberto Labadie,Daniel Castro Castro,Reynier Ortega Bueno |dblpUrl=https://dblp.org/rec/conf/clef/LabadieCB20 }} ==Fusing Stylistic Features with Deep-learning Methods for Profiling Fake News Spreader== https://ceur-ws.org/Vol-2696/paper_242.pdf

Fusing Stylistic Features with Deep-learning Methods
for Profiling Fake News Spreaders
Notebook for PAN at CLEF 2020

Roberto Labadie-Tamayo, Daniel Castro-Castro, and Reynier Ortega-Bueno

Universidad de Oriente, Santiago de Cuba, Cuba
roberto.labadie@estudiantes.uo.edu.cu , danielbaldauf@gmail.com ,
reynier.ortega@uo.edu.cu

Abstract False or unverified information spreads just like accurate information
on social media platforms, thus possibly going viral and influencing the public
opinion and its decisions. Fake news represents one of the most popular forms of
false and unverified information, and should be identified as soon as possible for
minimizing their dramatic effects. In order to face this challenge, in this paper
we describe our system developed for participating in the Author Profiling task:
“Profiling Fake News Spreaders on Twitter” proposed at the PAN 2020 Forum.
Our proposal learns two representations for each tweet in an account’s profile.
The first one is based on CNN and LSTM nets analyzing the tweets at word
level, the second one is learned by using the same architecture without sharing
the weights, but at this time, the tweets are analyzed at character-level. These
representations are used for modeling the accounts’ profiles. Also is conceived
for the whole account’s profile a general representation based on stylistic fea-
tures, which contains information about the writing style. Finally, the profiles’
representations are given as inputs to our classification model which is based on
LSTM neural nets with attention mechanism. Experimental results show that our
model achieves encourage results.

1 Introduction

In this actual modern society, a big part of the communication between peoples is es-
tablished by digital information, using videos, photos, texts, among others. Everyday
the amount of digital information generated by each person is growing very fast and the
analysis of these data to satisfy the users’ needs is also a challenging task. Specifically
the analysis of textual data introduces several challenges because texts are unstructured
information written in Natural Language (NL), also there are a lot of textual genre and
diversity of languages.
Computational applications for e-marketing and Forensic Analysis based on textual
information deal with the mentioned diversity of challenges, but also it is important for
them to known social demographic characteristics of author’s texts in order to direct
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
loniki, Greece.
the extraction of valuable information from textual sources. Author Profiling (AP) task
aims at discovering different marks or patterns (linguistic or not) from texts, that allows
to identify some characteristics of the authors, e.g. age, sexual gender, personality, etc.
One of the most important evaluation forums to share recent research and ideas
for tasks such as Author Profiling and Authorship Detection (AD), can be found in
the platform PAN 1 . Specifically, AP task has focused in the last editions on analyzing
micro-blogging textual genre due to the importance to known what and how information
is spread among users and communities, and which are the profile characteristics of
such users.
PAN 2016 profiling task [24] addressed the prediction of age and gender from a
cross-genre perspective, using tweets as training samples and a different textual genre
for evaluation. The main goal of PAN 2017 profiling [23] was to identify the gender and
language variety from Twitter messages. In PAN 2018 profiling task [22] the goal was
to address gender identification from a multi-modal perspective considering text and
images. The last edition PAN 2019 profiling task [19] focused on determining whether
the author of a Twitter feed is a bot or human and to profile the gender for human
authors.
There are others efforts in the research community for AP analysis, such as profiling
authors from Arabic [21][20][28], Russian [14] and the Mexican variant of the Spanish
language [2]. Generally, the most studied languages had been English and Spanish,
whereas gender and age are the social demographic characteristics most explored. In
addition, Twitter has raising as the most salience textual genre, due to its popularity as
social platform.
A bad phenomenon that is not new, but in last years have gained special significance
is the spreading of Fake News, due to its dramatic growing in social media and news
sources. That is the reason why a lot of recent works try to discover when a text is
fake or at least it contains false or unverified information. However, as important as
knowing when a text is fake or not, is to verify when a source generates and spreads
fake information or misinformation, for example, a company that promotes its product
without the quality promoted, false data published in an election campaign, etc. These
last scenarios could be analyzed with the use of Author Profiling technique.
In this direction, the PAN Profiling task of this year “Profiling Fake News Spreaders
on Twitter”, focus on “Given a Twitter feed, determine whether its author is keen to be
a spreader of fake news” [18].
The working note is organized as follow: in the next section a brief description about
the main aspects of the related works presented in the last three profiling evaluation
forum. Next, we present our proposal. Specifically, we describe the data preprocessing
as well as the deep-learning method used as classification model. Finally, we describe
the experimental setting, the experiments conducted and the results achieved.

2 Related Works
Different aspects would be considered as part of the AP algorithms presented by the
research community and highlighted in the overviews published by the organizers on
1
https://pan.webis.de/
different PAN profiling editions. An important issue is the preprocessing step applied
to the texts, considering that genre dependent characteristics can be eliminated or ho-
mogenized, such as in tweets, the use of hashtag, mentions, URL, etc. Generally, pre-
vious approaches have used content and style-based features and in recent years words
and characters embeddings [23][22][19]. Regarding the machine learning algorithms,
the most of studied works have employed traditional methods like Logistic Regression
(LR) [31], Support Vector Machine (SVM) [26] and Random Forest (RF) [11], among
others.
The textual information has been represented through lexical and character features,
by using of n-gram models and bag of words (BoW) representation [26][31]. Several
approaches have employed style features like the analysis of capital letters, function
words, punctuation marks, etc. [6][8]. In recent PAN profiling task, words and docu-
ments embedding was introduced [23][22][19][15][12].
Shallow machine learning methods are the most employed ones, and among these,
the most used are SVM, RF, LR and distance-based approaches. Generally, since PAN
2017, deep-learning techniques were introduced. Particularlly, the methods have fo-
cused on the use of Convolutional Neural Nets (CNN) and Recurrent Neural Nets
(RNN) [25][5].
The goal of our approach is to fuse general stylistic features with deep learning-
based representations. The stylistic feature vector will be used as an input representation
to be part of the learning process in our deep-learning architecture.

3 Our Proposal

The motivation of our approach is twofold: firstly the ability of deep-learning methods
to learn feature representations that are omitted in hand-craft features engine, also the
dexterity of these methods for modeling abstraction levels beyond of human bounds.
Secondly, author profiling task based on stylistic and linguistic features combined with
shallow supervised learning methods have been well studied in previous research works.
These features have proved to be adequate descriptors to determine some author charac-
teristics such as: age, sexual gender, language variety, personality, etc. Keeping this in
mind, our proposal learn two representations for each tweet in an account profile, which
are combined with a whole account’s profile representation based on stylistic features
as can be shown in Figure. 1.
The first representation is based on Convolutional and Long Short Term Memory
networks analyzing the tweets at word level (encoder word-level, R1), the second rep-
resentation is learned by using the same architecture without sharing the weights of the
network, but at this time the tweets are analyzed at character level (encoder character-
level, R2). Finally, the last representation is based on stylistic features (R3), which
contain relationships between the use of grammatical structures like nouns, adjectives,
lengths of words, function words, etc.
Our overall classification model is based on LSTM neural nets with attention mech-
anism (LSTM-Att). It receives as input the sequence of tweets in a Twitter’s feed.
Firstly, each tweet in the sequence is encoded by a dense vector which is composed by
the representations obtained by the encoder at word-level and character-level respec-
Figure 1: Overall Fake News Spreaders classification model.

tively. Later, the output vector learned by our LSTM-Att is combined with the stylistic
features and fed into a dense layer to classify the account as fake spreader or not.
Details about our proposal are presented in the next subsections. Particularly, Sec-
tion 3.1 introduces the preprocessing phase carried out on the tweets. Following, in
Section 3.2 the stylistic features used in this work for modeling the Twitter’s feeds are
described. Later, Section 3.3 presents the architecture of the encoder models at word
and character levels for obtaining the deep learning-based representation of the mes-
sages. Finally, Section 3.4 describes our overall classification model (LSTM-Att) and
explains how the representations are fused to classify the Twitter’s feed as fake spreader
or not.

3.1 Preprocessing
In the preprocessing step, the tweets are cleaned. Firstly, the numbers and dates are
recognized and replaced by a corresponding wildcard which encodes the meaning of
these special tokens. Afterwards, tweets are tokenised and morphologically analyzed
by means of FreeLing[3][17]. Also, we deal with the problem of character flooding.
To this end, all repetitions of three or more contiguous characters are normalized to
only two characters. Notice that, this normalization is applied when the messages are
processing at word level, in case of character level we do not apply the normalization
step.

3.2 Stylistic Feature
A key aspect of our proposal is the input representation based on statistical style features
that capture information from distinct lexical and syntactical linguistic layers. For the
representation of a text, it was defined 177 features for capturing relevant characteris-
tics of the writing style of the author. Features were structured in six subsets considering
different textual layers. These layers are boolean, character, sentence, paragraph, syn-
tactic and the document.
Examples for each of the layers subset of features:
1. Boolean layer: Uses the same word to finish a sentence and to begin the next sen-
tence.
2. Character layer: Average length of words.
3. Sentence layer: Average number of words. Average number of distinct prepositions.
4. Paragraph layer: Average number of sentence. Average number of words.
5. Syntactic layer: Proportion of nouns over adjective.
6. Text layer: Average length of sentence.

Our stylistic feature-based representation are completely independent of the textual
genre, and in our proposal, the representation was build for each profile, see Figure. 2.,
for that reason several features can appear as duplicated value because it is considered
that the tweet is a document which contains only one paragraph, and this paragraph is
formed by one sentence.

Figure 2: Tweets feed representation.

Notice that in the representation were involved different data structure values, since
are computed boolean stylistic features as float data features.

3.3 CNN+LSTM Encoder Architecture

This section introduces the architecture of our CNN+LSTM encoder model shown on
Figure. 3.which can be divided in two parallel sub-architectures which do not share
weights but have the same structure. One of the encoder’s sub-architecture processes
the tweets at word-level whereas the another at character-level. This model aims at
encoding syntactic and semantic properties of the tweets which could be correlated
with the usage of fake and unverified content in the messages.

CNN Layers Our encoder model receives as input a tweet as a zero padded sequence,
in order of having the same length ls . This sequence is passed into an embedding layer.
Notice that, at word-level this layer is set up with fixed weights from Google’s Word2Vec
Figure 3: The architecture of the CNN+LSTM encoder

pre-trained embedding [16], whereas at character-level its weights are initialized ran-
domly and set up as trainable.
As output of the embedding layer a matrix of real values Êls ×d is obtained, where
each row is a dense vector that represents an element in the tweet sequence. Later, over
that matrix Êls ×d is applied convolutional operations (conv-op) through a CNN layer
composed by filters Fk ∈