=Paper=
{{Paper
|id=Vol-2696/paper_242
|storemode=property
|title=Fusing Stylistic Features with Deep-learning Methods for Profiling Fake News Spreader
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_242.pdf
|volume=Vol-2696
|authors=Roberto Labadie,Daniel Castro Castro,Reynier Ortega Bueno
|dblpUrl=https://dblp.org/rec/conf/clef/LabadieCB20
}}
==Fusing Stylistic Features with Deep-learning Methods for Profiling Fake News Spreader==
Fusing Stylistic Features with Deep-learning Methods for Profiling Fake News Spreaders Notebook for PAN at CLEF 2020 Roberto Labadie-Tamayo, Daniel Castro-Castro, and Reynier Ortega-Bueno Universidad de Oriente, Santiago de Cuba, Cuba roberto.labadie@estudiantes.uo.edu.cu , danielbaldauf@gmail.com , reynier.ortega@uo.edu.cu Abstract False or unverified information spreads just like accurate information on social media platforms, thus possibly going viral and influencing the public opinion and its decisions. Fake news represents one of the most popular forms of false and unverified information, and should be identified as soon as possible for minimizing their dramatic effects. In order to face this challenge, in this paper we describe our system developed for participating in the Author Profiling task: “Profiling Fake News Spreaders on Twitter” proposed at the PAN 2020 Forum. Our proposal learns two representations for each tweet in an account’s profile. The first one is based on CNN and LSTM nets analyzing the tweets at word level, the second one is learned by using the same architecture without sharing the weights, but at this time, the tweets are analyzed at character-level. These representations are used for modeling the accounts’ profiles. Also is conceived for the whole account’s profile a general representation based on stylistic fea- tures, which contains information about the writing style. Finally, the profiles’ representations are given as inputs to our classification model which is based on LSTM neural nets with attention mechanism. Experimental results show that our model achieves encourage results. 1 Introduction In this actual modern society, a big part of the communication between peoples is es- tablished by digital information, using videos, photos, texts, among others. Everyday the amount of digital information generated by each person is growing very fast and the analysis of these data to satisfy the users’ needs is also a challenging task. Specifically the analysis of textual data introduces several challenges because texts are unstructured information written in Natural Language (NL), also there are a lot of textual genre and diversity of languages. Computational applications for e-marketing and Forensic Analysis based on textual information deal with the mentioned diversity of challenges, but also it is important for them to known social demographic characteristics of author’s texts in order to direct Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. the extraction of valuable information from textual sources. Author Profiling (AP) task aims at discovering different marks or patterns (linguistic or not) from texts, that allows to identify some characteristics of the authors, e.g. age, sexual gender, personality, etc. One of the most important evaluation forums to share recent research and ideas for tasks such as Author Profiling and Authorship Detection (AD), can be found in the platform PAN 1 . Specifically, AP task has focused in the last editions on analyzing micro-blogging textual genre due to the importance to known what and how information is spread among users and communities, and which are the profile characteristics of such users. PAN 2016 profiling task [24] addressed the prediction of age and gender from a cross-genre perspective, using tweets as training samples and a different textual genre for evaluation. The main goal of PAN 2017 profiling [23] was to identify the gender and language variety from Twitter messages. In PAN 2018 profiling task [22] the goal was to address gender identification from a multi-modal perspective considering text and images. The last edition PAN 2019 profiling task [19] focused on determining whether the author of a Twitter feed is a bot or human and to profile the gender for human authors. There are others efforts in the research community for AP analysis, such as profiling authors from Arabic [21][20][28], Russian [14] and the Mexican variant of the Spanish language [2]. Generally, the most studied languages had been English and Spanish, whereas gender and age are the social demographic characteristics most explored. In addition, Twitter has raising as the most salience textual genre, due to its popularity as social platform. A bad phenomenon that is not new, but in last years have gained special significance is the spreading of Fake News, due to its dramatic growing in social media and news sources. That is the reason why a lot of recent works try to discover when a text is fake or at least it contains false or unverified information. However, as important as knowing when a text is fake or not, is to verify when a source generates and spreads fake information or misinformation, for example, a company that promotes its product without the quality promoted, false data published in an election campaign, etc. These last scenarios could be analyzed with the use of Author Profiling technique. In this direction, the PAN Profiling task of this year “Profiling Fake News Spreaders on Twitter”, focus on “Given a Twitter feed, determine whether its author is keen to be a spreader of fake news” [18]. The working note is organized as follow: in the next section a brief description about the main aspects of the related works presented in the last three profiling evaluation forum. Next, we present our proposal. Specifically, we describe the data preprocessing as well as the deep-learning method used as classification model. Finally, we describe the experimental setting, the experiments conducted and the results achieved. 2 Related Works Different aspects would be considered as part of the AP algorithms presented by the research community and highlighted in the overviews published by the organizers on 1 https://pan.webis.de/ different PAN profiling editions. An important issue is the preprocessing step applied to the texts, considering that genre dependent characteristics can be eliminated or ho- mogenized, such as in tweets, the use of hashtag, mentions, URL, etc. Generally, pre- vious approaches have used content and style-based features and in recent years words and characters embeddings [23][22][19]. Regarding the machine learning algorithms, the most of studied works have employed traditional methods like Logistic Regression (LR) [31], Support Vector Machine (SVM) [26] and Random Forest (RF) [11], among others. The textual information has been represented through lexical and character features, by using of n-gram models and bag of words (BoW) representation [26][31]. Several approaches have employed style features like the analysis of capital letters, function words, punctuation marks, etc. [6][8]. In recent PAN profiling task, words and docu- ments embedding was introduced [23][22][19][15][12]. Shallow machine learning methods are the most employed ones, and among these, the most used are SVM, RF, LR and distance-based approaches. Generally, since PAN 2017, deep-learning techniques were introduced. Particularlly, the methods have fo- cused on the use of Convolutional Neural Nets (CNN) and Recurrent Neural Nets (RNN) [25][5]. The goal of our approach is to fuse general stylistic features with deep learning- based representations. The stylistic feature vector will be used as an input representation to be part of the learning process in our deep-learning architecture. 3 Our Proposal The motivation of our approach is twofold: firstly the ability of deep-learning methods to learn feature representations that are omitted in hand-craft features engine, also the dexterity of these methods for modeling abstraction levels beyond of human bounds. Secondly, author profiling task based on stylistic and linguistic features combined with shallow supervised learning methods have been well studied in previous research works. These features have proved to be adequate descriptors to determine some author charac- teristics such as: age, sexual gender, language variety, personality, etc. Keeping this in mind, our proposal learn two representations for each tweet in an account profile, which are combined with a whole account’s profile representation based on stylistic features as can be shown in Figure. 1. The first representation is based on Convolutional and Long Short Term Memory networks analyzing the tweets at word level (encoder word-level, R1), the second rep- resentation is learned by using the same architecture without sharing the weights of the network, but at this time the tweets are analyzed at character level (encoder character- level, R2). Finally, the last representation is based on stylistic features (R3), which contain relationships between the use of grammatical structures like nouns, adjectives, lengths of words, function words, etc. Our overall classification model is based on LSTM neural nets with attention mech- anism (LSTM-Att). It receives as input the sequence of tweets in a Twitter’s feed. Firstly, each tweet in the sequence is encoded by a dense vector which is composed by the representations obtained by the encoder at word-level and character-level respec- Figure 1: Overall Fake News Spreaders classification model. tively. Later, the output vector learned by our LSTM-Att is combined with the stylistic features and fed into a dense layer to classify the account as fake spreader or not. Details about our proposal are presented in the next subsections. Particularly, Sec- tion 3.1 introduces the preprocessing phase carried out on the tweets. Following, in Section 3.2 the stylistic features used in this work for modeling the Twitter’s feeds are described. Later, Section 3.3 presents the architecture of the encoder models at word and character levels for obtaining the deep learning-based representation of the mes- sages. Finally, Section 3.4 describes our overall classification model (LSTM-Att) and explains how the representations are fused to classify the Twitter’s feed as fake spreader or not. 3.1 Preprocessing In the preprocessing step, the tweets are cleaned. Firstly, the numbers and dates are recognized and replaced by a corresponding wildcard which encodes the meaning of these special tokens. Afterwards, tweets are tokenised and morphologically analyzed by means of FreeLing[3][17]. Also, we deal with the problem of character flooding. To this end, all repetitions of three or more contiguous characters are normalized to only two characters. Notice that, this normalization is applied when the messages are processing at word level, in case of character level we do not apply the normalization step. 3.2 Stylistic Feature A key aspect of our proposal is the input representation based on statistical style features that capture information from distinct lexical and syntactical linguistic layers. For the representation of a text, it was defined 177 features for capturing relevant characteris- tics of the writing style of the author. Features were structured in six subsets considering different textual layers. These layers are boolean, character, sentence, paragraph, syn- tactic and the document. Examples for each of the layers subset of features: 1. Boolean layer: Uses the same word to finish a sentence and to begin the next sen- tence. 2. Character layer: Average length of words. 3. Sentence layer: Average number of words. Average number of distinct prepositions. 4. Paragraph layer: Average number of sentence. Average number of words. 5. Syntactic layer: Proportion of nouns over adjective. 6. Text layer: Average length of sentence. Our stylistic feature-based representation are completely independent of the textual genre, and in our proposal, the representation was build for each profile, see Figure. 2., for that reason several features can appear as duplicated value because it is considered that the tweet is a document which contains only one paragraph, and this paragraph is formed by one sentence. Figure 2: Tweets feed representation. Notice that in the representation were involved different data structure values, since are computed boolean stylistic features as float data features. 3.3 CNN+LSTM Encoder Architecture This section introduces the architecture of our CNN+LSTM encoder model shown on Figure. 3.which can be divided in two parallel sub-architectures which do not share weights but have the same structure. One of the encoder’s sub-architecture processes the tweets at word-level whereas the another at character-level. This model aims at encoding syntactic and semantic properties of the tweets which could be correlated with the usage of fake and unverified content in the messages. CNN Layers Our encoder model receives as input a tweet as a zero padded sequence, in order of having the same length ls . This sequence is passed into an embedding layer. Notice that, at word-level this layer is set up with fixed weights from Google’s Word2Vec Figure 3: The architecture of the CNN+LSTM encoder pre-trained embedding [16], whereas at character-level its weights are initialized ran- domly and set up as trainable. As output of the embedding layer a matrix of real values Êls ×d is obtained, where each row is a dense vector that represents an element in the tweet sequence. Later, over that matrix Êls ×d is applied convolutional operations (conv-op) through a CNN layer composed by filters Fk ∈