Profiling Hate Speech Spreaders on Twitter using stylistic features and word embeddings Notebook for PAN at CLEF 2021 Lucía Gómez-Zaragozá 1 and Sara Hinojosa Pinto 2 1 Instituto de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, Valencia, Spain 2 Multiscan Technologies S.L., Universitat Politècnica de València, Valencia, Spain Abstract This paper presents the different solutions proposed for the Profiling Hate Speech Spreaders on Twitter task at PAN 2021, which consists of classifying each author as hater or no hater from a set of tweets, for Spanish and English languages. The given approaches are different for each language. For Spanish, an ensemble of LSTM and a Logistic Regression model trained with stylistic features is used. For English, an ensemble of SVC and Random Forest model, also with stylistic features, is proposed. Our solutions achieved an accuracy of 83% in Spanish and 58% in English, resulting in an overall accuracy of 70.5% in the task ranking. Keywords 1 Hate speech, author profiling, natural language processing, NLP, embeddings, LSTM, Twitter, machine learning 1. Introduction Automatic hate speech detection on social media has become a topic of growing interest in the artificial intelligence community and particularly, in the area of Natural Language Processing [1]. Although different definitions can be found in the literature, hate speech is commonly described as language that attacks or disparages a person or a group based on specific characteristics that include, among others, physical appearance, nationality, religion or sexual orientation [2]. Given the huge amount of user-generated content and the rapid dissemination of information these days, being able to identify not isolated hate speech comments but hate speech spreaders is a key first step in trying to prevent hate speech from spreading in online communications. This paper describes the proposed models for the PAN 2021 Profiling Hate Speech Spreaders on Twitter [3], which is one of the three proposed tasks at CLEF 2021 [4] deployed on TIRA platform [5]. The dataset provided in the shared task consisted of a balanced set of users that have shared some hate speech tweets, labeled as haters and non-haters otherwise. It was provided in two languages, namely Spanish and English. For each of them, the dataset it included 200 different users and 200 tweets per user. As recommended by the shared task, we presented a different solution for each language. For the Spanish dataset, an ensemble of LSTM and a logistic regression model trained with stylistic features is proposed, which achieved 83% of accuracy in the provided test set. For the English dataset, an ensemble of Support Vector Classification and Random Forest both based on stylistic features is presented, which achieved 58% of accuracy in the provided test set. In Section 2 we present some related work on profiling hate speech spreaders. In Section 3 we describe the two approaches proposed, including the description of the features used and the 1 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania EMAIL: lugoza@i3b.upv.es (A. 1); shinojosa@multiscan.eu (A. 2) ORCID: 0000-0001-9885-2559 (A. 1); 0000-0002-8166-8138 (A. 2) Copyright ©️ 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2021 - Conference and Labs of the Evaluation Forum, September 21-24, 2021, Bucharest, Romania. CEUR Workshop Proceedings (CEUR-WS.org) implemented machine learning models. In Section 4 we present the experimental results achieved for both languages independently. Finally, in Section 5, we present the conclusions and future work. 2. Related work Generic text mining features are commonly used for hate speech detection [2]. These include several types of characteristics, such as those obtained from dictionaries, bag-of-words (BOW), N-grams, TF- IDF, Part-of-speech (POS) or word embeddings. There are also specific features for hate speech detection, but in some cases, they require additional user information (like gender, age or geographic localization), or they focus on specific stereotypes. Regarding the algorithms used for hate speech detection, which is typically considered as a binary classification (hate vs not-hate), the most common are Support Vector Machines, followed by Random Forest, Decision Trees and Logistic Regression [2]. More recent approaches use deep learning techniques, such as attention-based neural networks [6] or an ensemble of neural networks [7], obtaining good performance results. In addition, the aim of this shared task is not only to detect hateful content, but profile hate speech spreaders. In this sense, common features used in the field of author profiling are stylistic features (such as frequency of punctuation marks, capital letters, word frequency), content features (such as BOW, TF-IDF or N-grams), POS tags, readability features or emotional features (emotion words and emoticons) [8,9]. Other message features such as retweets, hashtags, URLs and mentions are also considered in this area and, recently, the word and character embeddings are also applied. Regarding the algorithms used for author profiling, the traditional machine learning models are widely used, but in the last few years, deep learning approaches such as Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) have gained attention [10]. 3. Methodology The proposed models aim to discriminate hate speech spreaders from those who have never shared hate speech content on Twitter. They were built as an ensemble of classifiers, using two different approaches. On the one hand, stylistic features were extracted for each tweet and statistics per author were obtained from them in order to apply classic machine learning algorithms. On the other hand, a neural network with word embeddings was trained with the groups of tweets. Since no development set was provided in the shared task, it was decided to randomly split the training set into two partitions: 90% of the users for the development set and 10% for the test set, each containing 180 and 20 users respectively. These data partitions were used to evaluate the models using the official metric for this task, the accuracy, and to compare their performance on the same unseen data. The best models were then applied to the test data provided in the shared task, whose results were used to rank the performance of our system. The following sub-sections describe the two different approaches. 3.1. Word embeddings and LSTM In this approach, an aggregation of everyone's set of tweets was first performed in order to obtain one text per subject. A preprocessing step was also applied in order to remove accents, capital letters, double spaces and stop-words. Then, the development set was divided into two partitions: 60% of the users for the training set and 40% of the users for the validation set, with 108 and 72 users respectively. First, a tokenizer with a selected number of maximum words was adjusted to the training set, so that only the top words remained in the vocabulary, and the less used words were eliminated. The next step was to convert texts into sequences, meaning that each word of the tweet was translated into the index of that word in the vocabulary. The last step was to pad the sequences, so that they all had the same length regardless of the number of words they originally had. Specifically, a maximum sequence length of 1000 words was set, with the sequences being the collection of tweets from one subject, so that none of them would be trimmed. The training of the word embeddings with the configured dimension is performed simultaneously with the rest of the neural network parameters. Once the above steps have been completed, the training of the neural network can be performed, setting the maximum number of words to be considered in the tokenizer and the word embedding dimensions. The neural network architecture is based on an LSTM, as shown in Figure 1, and it was trained using the categorial cross-entropy as the loss function. Figure 1: Neural network architecture based on word embeddings and LSTM. 3.2. Stylistic features and classical machine learning algorithms To obtain the model that discriminates between a hater and a no hater, a set of stylistic features was calculated for each tweet independently. These characteristics have been divided into three groups: pattern-related, word-related and emoji-related features. The first group include, among others, the number of occurrences of certain patterns in the texts (such as hashtags, URLs or retweets) and the number of certain characters (such as symbols or letters). Word-related features include counts of particular words, such as nouns, verbs or adjectives. Both feature sets were calculated using regular expressions with the RegEx Python module [11] and the English model “en_core_web_sm” for de English dataset and “es_core_news_sm” for the Spanish dataset from spaCy Python library [12] for lemmatization and identification of word categories. Regarding the emojis, they were analyzed and grouped following different categories from the advertools Python library [13]. The rate between the unique emojis and the total emojis in the tweet was also included. The total set of 37 characteristics, referred to here as handcrafted features, is shown in Table 1. Table 1 Stylistic features extracted per tweet, divided into three groups: pattern, word and emoji related features. Pattern-related features • Retweets • Symbols • Mentioned users • Arousal symbols [¿?¡!] • URLs • Capital letters • Hashtags • Total letters • Laugh expressions Word-related features • Stopwords • Verbs • Adjectives • Repeated words • Nouns • Total words • Proper nouns • Letters/words Emojis-related features • Ratio unique / total emojis • Face-tongue • Face-affection • Face-unwell • Face-concerned • Body-parts • Face-costume • Emotion • Face-glasses • Gender • Face-hand • Hand-fingers-closed • Face-negative • Hand-fingers-partial • Face-neutral-skeptical • Hand-single-finger • Face-sleepy • Hands • Face-smiling • Person-gesture Once the stylistic features were calculated for each tweet, four statistics (mean, standard deviation, minimum and maximum) were computed for all the tweets of the same user. As result, a vector of 148 stylistic features was obtained for each author. Features were then standardized by subtracting the mean and dividing by the standard deviation for the development set, and these values were then applied to the test. As there were only 200 different users in the dataset, a feature reduction method was applied to reduce the number of characteristics. First, the Pearson's correlation matrix was calculated, and high- correlated features (p > 0.95) were eliminated. Then, a filter method was implemented to avoid overfitting. It consisted of calculating the area under the ROC Curve for each characteristic and removing those with values close to 0.5, which mean that they were not relevant for the classification task. With this method 50% of the features were eliminated, remaining the features with more information. Finally, sequential backward selection was applied to determine the optimal combination of N features for classification in the range 10 to a threshold (T) of the maximum allowed characteristics, which could be 15, 20 or 30 items, respectively. This selection method iteratively computes a criterion function for a given machine learning classification algorithm using a cross- validation strategy. In each iteration, one feature is removed at a time to create n-1 subsets of features. For each of them, a machine learning model is trained, and the criterion function for cross-validation is recalculated. Based on these results, the feature associated with the best performing model is removed, since removing it yielded the best result and therefore, is the one that helps the least in the classification. This process, called feature ablation, is repeated until 10 features are left. In this work, we used accuracy as the criterion function and the stratified K fold cross-validation with five folds as cross-validation strategy. Regarding the machine learning classification algorithms, the following were chosen, Support Vector Classification (SVC), K-Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest (RF) and Decision Tree (DT). Each of these algorithms was applied sequentially in both sequential backward selection with default hyperparameters and hyperparameter tuning. In the latter step, the same cross-validation strategy was used as in the feature selection method, for the different hyperparameter combinations shown in Table 2. Finally, the test set was transformed by keeping only the selected features and applying the standardization with the training set statistics. The machine learning model was then applied with the chosen hyperparameters and the predictions were obtained. Table 2 Hyperparameter sets for the implemented machine learning models, with the default values used in cross-validation indicated with “default” in brackets. Model Hyperparameters SVC Kernel: Radial Basis Function (default), Sigmoid Gamma: 0.001, 0.01, 0.1, 1, 'auto', 'scale' (default) C: 1 (default), 10, 100, 1000 KNN Number of neighbours: 1, 3, 5 (default), 7 Weights: Uniform (default), Distance Metric: Euclidean, Manhattan, Minkowski (default) LR Penalty: l2 C: 1 (default), 20 logarithmically scaled values between -4 and 4 Solver: liblinear, lbfgs (default) RF Number of trees: 100 (default), 200, 300, 400, 500 Maximum depth of the tree: 2, 4, 6, 8, 10, unlimited (default) Number of features considered for the best split: sqrt(N) (default), log2(N), where N is the total number of features DT Maximum depth of the tree: 2, 4, 6, 8, 10, unlimited (default) Function to measure the quality of a split: entropy, Gini impurity (default). 4. Experimental results The following sections summarize the results obtained with the different datasets, in Spanish and English, and detail the final models chosen for each of them. 4.1. Spanish dataset As mentioned above, two approaches were evaluated for the dataset. Firstly, the word embedding described in Section 3.1 was trained using different combinations of parameters to obtain the best configuration. Table 3 shows the accuracy results obtained in the test set by modifying the maximum number of dictionary words to be tokenized between 1000, 2000, 3000 and 4000, keeping the embedding dimensions constant. Table 3 Neural network results varying the maximum number of vocabulary words for the Spanish dataset. Max number words Embedding dimension Test-accuracy 1000 10 0.65 2000 10 0.70 3000 10 0.80 4000 10 0.70 Based on the results of Table 3, the maximum number of words was set at 3000. Then, the embedding dimensions were modified between 5, 10 and 15. The results are shown on Table 4. Table 4 Neural network results varying the word embedding dimensions for the Spanish dataset. Max number words Embedding dimension Test-accuracy 3000 5 0.55 3000 10 0.80 3000 15 0.70 The experimentation conducted showed that the best performing network configuration consisted of a maximum of 3000 words considered in the tokenizer and a 10-dimensional embedding, achieving 80% accuracy. Despite the good results, the methodology described in Section 3.2 was used to obtain a new hater versus non-hater classifier based on stylistic features. The results are shown in Table 5, where the machine learning model and the number of features used by each model (N-features) are indicated. It also includes the following evaluation metrics: the cross-validation accuracy (CV-acc), the test accuracy (Test-acc), the true positive rate and the true negative rate in the test set (Test-TPR and Test-TNR, respectively). Only the models with the feature selection and hyperparameters that provided the best results have been included, rather than all combinations tested. Table 5 Results of the classical machine learning models for the Spanish dataset. Model N-features CV-accuracy Test-acc Test-TPR Test-TNR SVC 15 0.80 ± 0.09 0.70 0.90 0.50 KNN 12 0.80 ± 0.05 0.70 0.80 0.60 LR 18 0.80 ± 0.07 0.80 0.90 0.70 RF 14 0.79 ± 0.06 0.75 0.90 0.60 DT 15 0.71 ± 0.05 0.70 0.70 0.70 The highest accuracy was 80%, as in word embedding. This score was achieved with the logistic regression, both in the development test and in the test set, using the features listed in Table 6. Table 6 Selected features in the LR model for the Spanish dataset. Pattern-related features • Mean mentioned users • Std hashtags • Std mentioned users • Mean arousal symbols • Mean URLs • Mean symbols • Mean hashtags • Mean capital letters Word-related features • Std nouns • Mean letters/words • Max verbs • Std letters/words • Std total letters Emojis-related features • Mean emoji face-affection • Std emoji face-concerned • Std emoji face-affection • Mean emoji face-smiling • Mean emoji face-concerned As a last step, since both approaches achieved high accuracies, an ensemble of the two best models was built. The logistic regression and the word embedding scores were combined using the sum rule with an alpha weight associated with the score of each approach. It is shown in the equation (1), where 𝑠𝑐𝑐 is the combined score, 𝑠𝑐𝑙𝑟 is the score from the logistic regression, 𝑠𝑐𝑤𝑒 is the score from the word embedding and alpha is the weight in the range [0,1]. 𝑠𝑐𝑐 = 𝛼 · 𝑠𝑐𝑤𝑒 + (1 − 𝛼) · 𝑠𝑐𝑙𝑟 (1) To find the best alpha estimate, values between 0 and 1 were tested in increments of 0.05 for the development set. The alpha value that achieved the highest accuracy was 0.85, which reached 86% accuracy on the development set. The ensemble of the regression model and word embedding was finally applied to the test set provided in the task, achieving 83% accuracy. 4.2. English dataset As with the Spanish tweets, word embeddings were first tested to solve the classification task for the English dataset. The neural network was adapted to the dataset by modifying the maximum number words to be considered in the tokenizer between 1000, 2000, 3000 and 4000, keeping the embedding dimensions constant. The results are shown in Table 7. Table 7 Neural network results varying the maximum number of vocabulary words for the English dataset. Max number words Embedding dimension Test-accuracy 1000 10 0.45 2000 10 0.50 3000 10 0.40 4000 10 0.55 Although the results in Table 7 were not as expected, additional experiments were carried out by varying the embedding dimensions for the best of the configuration found. The results are shown in Table 8. Table 8 Neural network results varying the word embedding dimensions for the English dataset. Max number words Embedding dimension Test-accuracy 4000 5 0.45 4000 10 0.55 4000 15 0.50 The variation of the embedding dimensions also did not provide better results. Therefore, it was decided not to continue in this direction and to focus on the second approach based on classical machine learning classifiers. Following the pipeline described in Section 3.2, the results showed in Table 9 were achieved. The table shows the machine learning model and the number of features used by each model (N-features). It also includes the following evaluation metrics: the cross-validation accuracy (CV-acc), the test accuracy (Test-acc), the true positive rate and the true negative rate in the test set (Test-TPR and Test- TNR, respectively). Table 9 Results of the classical machine learning models for the English dataset. Model N-features CV-accuracy Test-acc Test-TPR Test-TNR SVC 13 0.72 ± 0.06 0.70 0.50 0.90 KNN 19 0.69 ± 0.08 0.55 0.80 0.30 LR 26 0.71 ± 0.05 0.55 0.30 0.80 RF 12 0.68 ± 0.08 0.65 0.60 0.70 DT 29 0.67 ± 0.04 0.65 0.60 0.70 According to the results, the best models were the SVC and the RF, which obtained 70% and 65% test accuracy, respectively. The characteristics used in each model are listed in Table 10 for the SVC and Table 11 for the RF. Table 10 Selected features in the SVC model for the English dataset. Pattern-related features • Mean retweet Word-related features • Std repeated words • Max total words • Std total words • Std letters / words ratio Emojis-related features • Std unique emojis • Max emoji face-hand • Mean emoji face-affection • Mean emoji face-sleepy • Max emoji face-affection • Std emoji face-unwell • Mean emoji face-hand • Std emoji-hands Table 11 Selected features in the RF model for the English dataset. Pattern-related features • Mean retweets • Mean URLs • Std mentioned users Word-related features • Std stopwords • Max proper nouns • Max nouns • Max verbs Emojis-related features • Mean emoji face-costume • Std emoji face-unwell • Mean emoji face-neutral-skeptical • Std emoji-hands • Std emoji face-sleepy Since the classical machine learning models based on stylistic features obtained better results, it was decided to create an ensemble of the two best models obtained. The final prediction was obtained by combining the SVC and RF scores using the sum rule with an alpha weight associated with the score of each model, as previously performed in the Spanish ensemble. The combination is shown in the equation (2), where 𝑠𝑐𝑐 is the combined score, 𝑠𝑐𝑠𝑣𝑐 is the score from the SVC, 𝑠𝑐𝑟𝑓 is the score from the RF and alpha is the weight in the range [0,1]. 𝑠𝑐𝑐 = 𝛼 · 𝑠𝑐𝑟𝑓 + (1 − 𝛼) · 𝑠𝑐𝑠𝑣𝑐 (2) To find the best alpha estimate, values between 0 and 1 were tested in increments of 0.05 for the development set. The alpha value that achieved the highest accuracy was 0.65, which reached 56% accuracy on the development set. The ensemble of the SVC and RF was finally applied to the test set provided in the task, achieving 58% accuracy. 5. Conclusions and future work This paper presented the proposed ensemble models for the PAN 2021 Profiling Hate Speech Spreaders on Twitter shared task at CLEF 2021. The problem was addressed in two languages, namely Spanish and English, and two approaches were presented for each of them, whose evaluations in the task ranking are summarized in Table 12. For the Spanish dataset, an ensemble was created from a neural network with word embeddings and a logistic regression. The first one was created with all the tweets grouped by subject, whereas the second was based on statistic obtained from stylistic features computed for each user’s tweet. This approach achieved 83% accuracy on the provided test set. Regarding English dataset, an ensemble of a support vector classifier and a random forest, both based on statistics of stylistic features, achieved 58% accuracy on the provided test set. Table 12 Accuracy in test data provided in the shared task for the English and Spanish models, and the mean of both used for the task ranking. Approach Accuracy (%) English ensemble 58.0 Spanish ensemble 83.0 Average 70.5 Overall, the results showed that stylistic characteristics are important features to consider when identifying hate speech spreaders, as they helped to improve the results of the word embeddings in Spanish, and they obtained better results than word embedding for the English dataset. However, the task of detecting hate speech spreaders turned out to be very difficult for the English dataset. The best accuracy result was only 70% in our test partition, which accounted for 58% in the test provided in the shared task. Word embeddings were investigated for this language, but they were not included because they showed not accurate results, contrary to Spanish. The difference in accuracy between English and Spanish may indicate that users have different hate-spreading behaviors in different cultures. Future work will include adding more features, such as TF-IDF based n-grams for both words and characters. 6. References [1] F. Poletto, V. Basile, M. Sanguinetti, C. Bosco, V. Patti, Resources and benchmark corpora for hate speech detection: a systematic review, Language Resources and Evaluation (2020) 1–47. [2] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM Computing Surveys (CSUR) 51 (2018) 1–30. [3] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech Spreaders on Twitter Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [4] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wolska, , E. Zangerle, Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection, in: 12th International Conference of the CLEF Association (CLEF 2021), Springer, 2021. [5] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/978-3-030-22948-1\_5. [6] H. J. Jarquín-Vásquez, M. Montes-y Gómez, L. Villaseñor-Pineda, Not all swear words are used equal: Attention over word n-grams for abusive language identification, in: Mexican Conference on Pattern Recognition, Springer, 2020, pp. 282–292. [7] S. Zimmerman, U. Kruschwitz, C. Fox, Improving hate speech detection with deep learning ensembles, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. [8] F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, G. Inches, Overview of the author profiling task at pan 2013, in: CLEF Conference on Multilingual and Multimodal Information Access Evaluation, CELCT, 2013, pp. 352–365. [9] F. Rangel, P. Rosso, M. Potthast, M. Trenkmann, B. Stein, B. Verhoeven, W. Daelemans, et al., Overview of the 2nd author profiling task at pan 2014, in: CEUR Workshop Proceedings, volume 1180, CEUR Workshop Proceedings, 2014, pp. 898–927. [10] F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th author profiling task at pan2017: Gender and language variety identification in twitter, Working notes papers of the CLEF (2017) 1613–0073. [11] G. Van Rossum, The python library reference, release 3.8. 2, Python Software Foundation (2020) 36. [12] M. Honnibal, I. Montani, spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing, To appear 7 (2017) 411–420. [13] Elias Dabbas, advertools: productivity and analysis tools to scale your online marketing, 2021.URL: https://advertools.readthedocs.io/en/master/readme.html.