=Paper=
{{Paper
|id=Vol-2765/129
|storemode=property
|title=TextWiller @ SardiStance, HaSpeede2: Text or Con-text? A Smart Use of Social Network Data in Predicting Polarization (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper129.pdf
|volume=Vol-2765
|authors=Federico Ferraccioli,Andrea Sciandra,Mattia Da Pont,Paolo Girardi,Dario Solari,Livio Finos
|dblpUrl=https://dblp.org/rec/conf/evalita/FerraccioliSPGS20
}}
==TextWiller @ SardiStance, HaSpeede2: Text or Con-text? A Smart Use of Social Network Data in Predicting Polarization (short paper)==
TextWiller @ SardiStance, HaSpeede2: Text or Con-text? A Smart Use of Social Network Data in Predicting Polarization Federico Ferracciolia , Andrea Sciandrab , Mattia Da Pontc , Paolo Girardia , Dario Solarid , Domenico Madonnaa , Livio Finosa a. Università degli Studi di Padova b. Università degli Studi di Modena e Reggio Emilia c. WMRI d. BeeViva ferraccioli@stat.unipd.it, andrea.sciandra@unimore.it, mattia.dapont@wmr.it, paolo.girardi@unipd.it, dario.solari@gmail.com, domenico.madonna@studenti.unipd.it, livio.finos@unipd.it Abstract tral/none towards the given target, exploiting only textual information, i.e. the text of the tweet. The In this contribution we describe the system Task B is the same as the first one, except a wider (i.e. a statistical model) used to participate range of contextual information are available, that in Evalita conference 2020, SardiStance is: the number of retweets, the number of favours, (Tasks A and B) and Haspeede2 (Tasks the type of posting source (e.g. iOS or Android), A and B). We first developed a classifier and date of posting. Furthermore, the networks by extracting features from the texts and of the users based on Friends, Quote, Reply and the social network of users. Then, we Retweet were provided. We developed two sys- fit the data through an extreme gradient tems (i.e. models) extracting features from the boosting, with cross-validation tuning of text (both for Task A and B) and from the social the hyper-parameters. A key factor for a network of the users (only for Task B) and then good performance in SardiStance Task B exploited extreme gradient boosting (Chen et al., was the features extraction by using Mul- 2020) to train the model on the data. A cross- tidimensional Scaling of the distance ma- validation hyper-parameter tuning was used to de- trix (minimum path, undirected graph) ap- fine the optimal set of parameters. plied on each network. The second sys- We use a very similar strategy for HaSpeede2 tem exploits the same features above, but (Sanguinetti et al., 2020) where the goal is the pre- it trains and performs predictions in two- diction of Hate Speech (i.e. Task A) and Stereo- steps. The performances proved to be type (i.e. Task B). In this case, however, the sam- lower than those of the single-step model. ple contains documents from three different top- ics. We believe that these may be characterized 1 Introduction by different vocabularies and kind of speech. We In this paper we describe and show the results of take this in account in the prediction model as ex- the approach we developed to participate in the plained in 3.3. SardiStance task (Cignarella et al., 2020) for the polarity detection (i.e. Task A and B, both with 2 Features extraction and E.D.A. constrained data) within the EVALITA campaign 2.1 Text-based Features extraction (Basile et al., 2020). The goal of this task was a Stance Detection in Italian tweets about the Sar- The text preprocessing was done in R (R Core dines movement. The Task A is a three-class Team, 2019) software with the package TextWiller classification task where the system has to pre- (Solari et al., 2019) (function normalizzaTesti with dict whether a tweet is in Favour, Against or Neu- default parameters). We describe the preocess used to define the features for both for SardiStance Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- and HaSpeede2. ternational (CC BY 4.0). The first set of features is defined by the Sentiment vs True Label columns of the DocumentTermMatrix which is a matrix having documents on the rows and a col- umn for each term. The cells contain the num- Positive ber of given words in the document. We defined the matrix on the basis of the normalized texts and removing terms (i.e. columns) with a sparsity larger than .9. These procedures generated a 317 Neutral sentiment terms vocabulary for SardiStance and 170 terms Sentiment Negative for HaSpeede2. Neutral Positive In Figure 1 we plot the term frequencies of the ”In favour” and ”Against” stances. The terms close to the bisector are the ones with a simi- Negative lar frequency in the two classes (such as ”caro”, ”alto”, ”acqua”), so probably these terms don’t carry much useful information to our cause. More often we found interesting terms far from the bi- AGAINST NONE FAVOR True Label sector, like ”bolognanonsilega”, ”antifascismo”, ”abuso” or ”branco” and we expected these terms to carry more weight in the classification model. Figure 2: The Mosaic plot of True stances and Sentiment shows a clear association between the AGAINST two variables. 10.000% sardine emote non 1.000% piazza bologna pi based on neural networks and generate dense vec- FAVOR lega bello cosa pd ancora ora parte tors for word representation, by defining a con- 0.100% bolognanonsilega grande altro cittadini qui bisogna tanto dopo nov uniti amici son bravo governo text window, i.e. a string of words before and forti adesso accordo arriva tv antifascismo caro almeno nulla agosto cari alto ex avere capo after a focal word, that will be used to train a 0.010% accanto acqua ah eh andate word embedding model. In WE, words are repre- abbandonati abuso branco prodi sented as coordinates on a latent multidimensional 0.010% 0.100% 1.000% 10.000% space derived from an underlying deep learning model that considers the contiguous words. So, Figure 1: Scatterplot of ”Favour” and ”Against” for both tasks we also used a WE technique to term frequencies. produce context-based features. In particular, we used the word2vec model (Mikolov et al., 2013), Further text features considered were: the num- a widely used natural language processing tech- ber of characters and the number of words, the nique to extract word associations from a large counts of ”?” and ”!” for each document. More- corpus of text. word2vec is a neural network over, a sentiment value was computed for each prediction model containing continuous bag-of- document by sentiment function of the R package words (CBoW) model and Skip-gram (SG) model. TextWiller (Solari et al., 2019). The CBoW model predicts a target word from its Figure 2 shows the association between True context words, while the SG model predicts the Stances and Sentiment. This variable will be used context words given a target word. Since WE as a feature in Task A and B models. needs a huge corpus of textual data for training Previous analyses, such as sentiment attribu- and given the limited amount of tweets, we aug- tion through a lexicon, refer to a bag-of-words mented the data with the corpus PAISÀ (Lyding et (BoW) approach. One of the most notable dis- al., 2013), a large collection of Italian web texts. advantages of BoW is that it generally fails to We trained the model with embedded dimension capture words semantics by ignoring words order. set to 50 and a 5 words context window. The re- A common solution to this problem involves the sults for each word are then combined via averag- use of Word Embedding (WE). WE techniques are ing to obtain the final features. 2.2 Network-based Features extraction mous of “system”. We adopted the R implementa- A key point to explain the good performance in tion of the XGBoost (eXtreme Gradient Boosting) the SardiStance Task B (i.e. second best score, (Chen et al., 2020). A cross-validation parameter F-avg = 0.7309) is the efficient extraction of fea- tuning was used to define the optimal set of pa- tures from the four Networks available, that is: rameters. Friends, Retweet, Reply, and Quote. For each 3.1 System One network, a distance matrix among subjects was computed. The distance used is the shortest path, As features for Task A, we used information taken forcing the graph to be undirected. The Distance from the text, that is, words/emoticons, special Matrix was then projected into a euclidean space characters, scores of word embedding (50 dimen- trough a Multidimensional Scaling (MDS). Since sions), sentiment, length of the message and num- we expected the users to be strongly polarized ber of words. in clusters within the network, we also expected For Task B we used the same features used for the largest dimension to discriminate among the Task A together with the first and the second di- stances. Therefore, we retained the first and sec- mension extracted from the MDS computed for ond dimension for each of the four networks. This each network (as explained in 2.2). expectation was confirmed by Exploratory Data 3.2 System Two Analysis. As an example, in Figure 3 we show the scatter plot of the first two dimensions for the Since System Two uses the same features of Sys- Friend Network. The First Dimension clearly dis- tem One for Task A and B, the focus here is on the criminates the three stances (in particular Favour employed metric: the average between F 1Against vs Against). and F 1F avour . With the aim to cast the model into the metric, we fitted two separated models (i.e. ● one for Favour and one for Against) in the first 2 ● step and then we combine the two predictions in ● ● ● ● ● a second step. To be more precise, the two mod- ● ●● ●● ● ● ● ● ● 1 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● els used in the first step predict if a document is ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●● ●● ● ● ● ● ●● ● Label in Favour or not (first model) and if is Against friend2 ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● AGAINST ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ●● ● ●●● ●●●● ●● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ●●● ● ● ● ● ●●● ● ● ● ●●●● ● ● ● ● ●● ●●● ● ● ●●●●● ●●● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ● ● ●● ●● ● ●●●● ● ●● ● ● ●● ● ● ● ● NONE or not (second model). The two prediction are ● ● ●●● ● ●●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ●●●●●● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ●●● ● ●● ●●●● ● ● ●● ● ● ● ●● ●●●● ● ●● ●●●● ●● ●● ● ●●●●● ●●●● ●● ● ●●● ●● ●● ● ● FAVOR 0 ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ●● ● ●● ● ●● ●●● ● ●●● ● ●● ●● ● ●● ●● ●● ●● ● ●●●● ● ● ●●● ● ●●● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ●● ●● ● ● ● ●●●● ●●●● ●● ●● ● ● ● ● ●●● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ●●●● ● ● ● ●● ●● ● ●● ● ●● ●● ●● ● ● ● ●●●● ● ● ●● ● ●● ● ●●● ● ●● ●● ● ●● ●● ●● ●● ●●● ●●● ● ● ● ● ● ● ● ●●●●●● ● ● ●● ● ● ● ●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ●●●● ●● ● ●●● ●●●● ●● ●● ● ● ● ●● ● ● ●●● ● ●● ●●● ●●● ● ● ● ● ●● ●● ● ●● ●●● ● ● ●●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ●●● ●●● ● ●●● ● ●● ● ● ●● ● ● ●●●● ●●● ● ● ● ● ● combined in a final score by a simple subtraction: ● ● ● ●● ● ●● ● ● ●●●●● ●●●● ● ●●● ●●● ● ●●● ● ●●● ● ●●● ● ●●●●●● ● ●●●● ● ●●● ●● ● ●● ● ●● ●● ●● ● ●●●●● ●●● ●● ●● ●● ● ●● ● ● ● ●● ●●● ● ● ●● ●● ● ●● ● ● ● ●●●●●● ● ●●● ●● ● ● ●●● ●● ●●●● ● ● ●● ● ● ●●●● ● ●●●●● ●● ● ● ●● ●● ● ●●●● ●●● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ● ● ●●● ●●● ●● ● ●●●●●● ●● ●● ●●● ● ● ● ●● ● ●● ●● ● ●●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ●● ●●● ● ● ●●● ● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ● ●● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ●●● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ●●●●● ● ●● ● ●●●●● ● ●● ● ● ● ●● ● ●●● ● ● ●●● ●●● ● ● ● ●● ● ●● ●●● ● ● ●● ● ●●●●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● (Predicted1==Favour) - (Predicted2==Against) ● ● ●● ● ● ●●● ● ● ● ● ● ●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ●●●●●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ●●●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ●● ●● ● ●● ●●● ● ● ●●●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● which makes a -1,0,1 final score. ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ● −1 ● ● ● ● ●● ● ●● ● 3.3 System for HaSpeeDe2 ● ● ●● ● ● −1 0 1 friend1 The corpus of documents for HaSpeeDe2 is a sam- ple of tweets from three different topics, namely Figure 3: Scatter plot of the First and Second di- Immigrants, Muslims and Roma communities. mension extracted by the MDS from the distance Since the vocabulary may change among topic, we matrix of the Friend Network (minimum path dis- want our models to account for this specificity. We tance). There is a clear separation between be- leverage on this with models that use the estimated tween the stances Favour and Against along the topic. The topic is estimated by a xgboost model first axis. (trained by cross-validation). Table 1 and Table 2 report the confusion matrix and performances in- 3 Developed Systems dices of the trained model (cross-validated). Due to the – relatively – small sample size of the Reference train set (composed from 2,132 tweets in Italian, Prediction Immigrants Rom Terrorism the BenderRule), we decided not to use any neural Immigrants 408 24 55 network. Instead, we preferred a Gradient Boost Rom 24 780 16 approach (Friedman, 1999). Since this method has Terrorism 41 8 192 been developed within the statistical learning com- munity, we used the word “model” as a synony- Table 1: Confusion matrix for the xgboost model. Index Immigrants Rom Terrorism Reference Sensitivity 0.86 0.96 0.73 Prediction AGAINST NONE FAVOUR Specificity 0.93 0.95 0.96 AGAINST 613 118 108 F1 0.85 0.96 0.76 NONE 32 22 12 FAVOUR 97 32 76 Table 2: Sensitivity, Specificity and F1 for each topic for the xgboost model. Table 3: Confusion Matrix for Task A (System One). F 1Against = 0.776, F 1F avour = 0.3791, Final: (F 1Against + F 1F avour )/2 = 0.5773 System One is based on an xgboost with bino- mial response (for both tasks). The fitting is done Reference separately, after splitting of the sample based on Prediction AGAINST NONE FAVOUR the topic classification provided by the model de- AGAINST 623 71 29 scribed above in this subsection. The model is NONE 54 44 27 trained with the same cross-validated strategy used FAVOUR 65 57 140 to train System One for the SardiStance Task. System Two is based on an xgboost with bino- Table 4: Confusion Matrix for Task B (System mial response (for both tasks). The estimate is One). F 1Against = 0.8505, F 1F avour = 0.6114, computed on the whole sample (i.e. without split- Final: (F 1Against + F 1F avour )/2 = 0.7309 ting of System One), but the topic classification is used as feature. To support the intuition that network-based fea- For both systems the basic set of features are the tures play a crucial role in this model, we explore same used in the SardiStance - Task A. the Importance of the Features. Results are given in Table 4.2 (Top 10). 4 Results and discussion 4.1 Results for HaSpeeDe2 Feature Importance 1 NW Retweet1 0.13 The results of the two systems are disappointing. 2 NW Friend1 0.12 The final ranks are always at the very bottom of 3 NW Quote2 0.04 the rankings. This may be partially due to a sub- 4 Created at 0.02 optimal parameters optimization (we discovered a 5 WE24 0.02 mistake in the parameter setting), but this is cer- 6 Statuses count 0.02 tainly not the only reason. We will take this result 7 NW retweet2 0.02 as an opportunity to revise the approach. 8 WE14 0.02 9 We10 0.01 4.2 Results for SardiStance 10 WE25 0.01 System Two performed poorly in the final score for both Tasks. Our intuition was that the benefit Table 5: Top 10 Features’ Importance. Legend: of a separate optimization of FAgainst and FF avour NW = MDS dimension of the network; WE = was overcome by the gain in doing a joint training Word-Embedding dimension. (i.e. System One). We will address further efforts to better understand this result. The top three far more important features were The results for System One are given in Table 3 dimensions extracted by the MDS approach ex- for Task A and Table 4 for Task B, respectively. plained in section 2.2. The rank of System One in Task A is 13, that 5 Conclusion is just below the benchmark. The System was weak in the correct estimation of Against stance For SardiStance, the System One proposed here (F 1Against = 0.776), while it estimated fairly performed well in the Task B, while it has a well Favour stance (F 1F avour = 0.3791). much poorer result in Task A. It exploits a simple The best performance of System One is on Task method to handle the network-based information, B (F 1Against = 0.8505, F 1F avour = 0.6114) while further refinement should be made on the where it scored 2nd position. exploitation of text-based information. In this way we want to stress the importance of data mashup, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey as the system we deployed showed better results Dean. 2013. Efficient estimation of word represen- tations in vector space. for Task B which contains, in addition to texts, in- formation of a different nature derived from net- R Core Team, 2019. R: A Language and Environment work structures. for Statistical Computing. R Foundation for Statis- It is to be expected that more networks should tical Computing, Vienna, Austria. carry similar information. A future direction of Manuela Sanguinetti, Gloria Comandini, Elisa research should be the joint analysis of the Net- Di Nuovo, Simona Frenda, Marco Stranisci, works. There is a sparkling community work- Cristina Bosco, Tommaso Caselli, Viviana Patti, and Irene Russo. 2020. Overview of the evalita ing on multilayer Networks (De Domenico et al., 2020 second hate speech detection task (haspeede 2013) (Durante et al., 2017) that may inspire more 2). In Valerio Basile, Danilo Croce, Maria Di Maro, effective use of this joint information. and Lucia C. Passaro, editors, Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA References 2020), Online. CEUR.org. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- Dario Solari, Andrea Sciandra, and Livio Finos. 2019. cia C. Passaro. 2020. EVALITA 2020: Overview of Textwiller: Collection of functions for text mining, the 7th Evaluation Campaign of Natural Language specially devoted to the italian language. Journal of Processing and Speech Tools for Italian. In Valerio Open Source Software, 4(41):1256. Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020). CEUR-WS.org. Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng, and Yu- tian Li, 2020. xgboost: Extreme Gradient Boosting. R package version 1.0.0.2. Alessandra Teresa Cignarella, Mirko Lai, Cristina Bosco, Viviana Patti, and Paolo Rosso. 2020. SardiStance@EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of the 7th Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). CEUR- WS.org. Manlio De Domenico, Albert Solé-Ribalta, Emanuele Cozzo, Mikko Kivelä, Yamir Moreno, Mason A. Porter, Sergio Gómez, and Alex Arenas. 2013. Mathematical Formulation of Multilayer Networks. Physical Review X, 3(4):041022, October. Daniele Durante, David B. Dunson, and Joshua T. Vo- gelstein. 2017. Nonparametric bayes modeling of populations of networks. Journal of the American Statistical Association, 112(520):1516–1530. Jerome H. Friedman. 1999. Stochastic gradient boost- ing. Computational Statistics and Data Analysis, 38:367–378. Verena Lyding, Egon Stemle, Claudia Borghetti, Marco Brunello, Sara Castagnoli, Felice Dell’Orletta, Hen- rik Dittmann, Alessandro Lenci, and Vito Pirrelli. 2013. PAISÀ corpus of italian web text. Eurac Re- search CLARIN Centre.