=Paper=
{{Paper
|id=Vol-3180/paper-198
|storemode=property
|title=BERT Sentence Embeddings in different Machine Learning and Deep Learning Models for
Author Profiling applied to Irony and Stereotype Spreaders on Twitter
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-198.pdf
|volume=Vol-3180
|authors=Claudia Gómez,Daniel Parres
|dblpUrl=https://dblp.org/rec/conf/clef/GomezP22
}}
==BERT Sentence Embeddings in different Machine Learning and Deep Learning Models for
Author Profiling applied to Irony and Stereotype Spreaders on Twitter==
BERT Sentence Embeddings in different Machine Learning and Deep Learning Models for Author Profiling applied to Irony and Stereotype Spreaders on Twitter Daniel Parres1,† , Claudia Gomez1,† 1 Universitat Politècnica de Valencia, Camí de Vera, s/n, 46022 València, Valencia, Spain Abstract Irony detection is an interesting problem with different fields of application, both for managing social media and for studying people’s behavior and opinions. This paper focuses on classifying Twitter author profiles as ironic or non-ironic based on their tweets. For this purpose, different types of feature extraction are applied to each author’s tweets, from classical techniques to the most recent ones that are part of the state of the art. Furthermore, once the feature extraction has been performed, classical Machine Learning techniques such as SVM, Random Forest, and Logistic Regression are applied, up to more recent methods such as artificial neural networks with Self Attention mechanisms. Finally, a discussion is opened on the methods used and what each technique can contribute to solving this task. As it happens in most of the tasks where Embedding techniques are applied in natural language, new frontiers of study, analysis, and application are opened. Therefore, this study provides different brushstrokes for the development of robust systems for the detection of ironic and non-ironic authors. Keywords Author profiling, Sentence Embeddings, Machine Learning, Deep Learning, BERT, Irony, 1. Introduction This work focuses on profiling authors based on their tweets into ironic or non-ironic, em- phasizing authors who employ irony to spread stereotypes. With this objective we use data provided by PAN’22 [1]. This data is composed of 420 Twitter profiles and 200 tweets from each of them; each user is labeled as ironic or non-ironic. The task of classifying authors as ironic or non-ironic based on their tweets is very interesting due to the current context in which we find ourselves, where anyone has access to social media and the freedom to share and spread their ideas. Because of this, identifying which authors are spreading comments that can be considered harmful from an assertive perspective is important both for managing and administering social media and sociological or psychological studies. Thanks to the emergence of different Artificial Intelligence techniques and algorithms, this task can be covered by Machine Learning. This work first performs a study of the data, analyzing † These authors contributed equally. Envelope-Open dparres@prhlt.upv.es (D. Parres); cgomros@posgrado.upv.es (C. Gomez) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) tweets and preprocessing them. Once the data is studied, different classical techniques and state-of-the-art algorithms such as artificial neural networks are applied. And finally, the results obtained are evaluated and discussed. 2. Related Work Since 2013, there are works focused on the detection of irony, an example is [2] which describes a set of textual characteristics to recognize irony at the linguistic level, focusing on short texts, such as tweets. The proposed model is evaluated in two dimensions: representativeness and relevance. In [3] encouraging results are demonstrated for deriving pragmatic contextual models for irony detection, which provides the application of a new approach beyond the use of features. While on the other hand, in [4], in a contest to detect ironic tweets in the Arabic language, it is shown that classical feature-based models are superior to neural ones. The task of author profiling is applied to different areas, such as detecting hate speech [5], gender [6] and age of authors [7], by analyzing their tweets, but this work focuses on detecting whether an author is ironic or not. In author profiling, different features have been used for years, such as text length, cosine similarity ranking, word retrieval, Okapi BM25 ranking, and NRC emotions, as proposed in [8]. With the approach of using different features, in [9] for irony detection, classical models and Multilayer Perceptron are used, together with statistical techniques such as counting, post-tagger, textual markers, and lexicon-based as wordnet similarity. But the best performing models are SVM, MLP, and Random Forest, while the best performing features are textual markers, sentiment score, and polarity value. In [10] and [11] the techniques and preprocessing that have obtained the best results in the PAN’19 and 20 contests for gender, bots and hate speech profiling are presented, where the extensive use of BERT word embeddings [12] and their use in neural models are highlighted. It can be seen from all the literature to date that in author profiling problems, the use of classical versus neural models tends to perform better. Although this is with the help of BERT word embeddings, without preprocessing techniques given that they usually do not improve accuracy rates as discussed in [13] with multiple experiments. 3. Methodology This section is divided into 2 subsections. The first one focuses on the analysis of the data and the study of the balancing of the two classes to know which metric is the most representative for the training of the algorithms, if there are repeated tweets and the most used words in each class. In the second subsection, the treatment of the data and the different models used are presented. To compare the models, 10-Fold Cross-Validation is used, with 90% for training and 10% test for classical models and 80% for training, 10% validation, and 10 3.1. Data Analysis The data provided by PAN’22 is composed of 420 users, with 200 tweets each, where each profile is labeled as ironic or non-ironic. The distribution of the two classes is 210 ironic and 210 non-ironic authors, so we have a balanced problem, and a good metric for the analysis and comparison of models is the accuracy. Another feature of the dataset is found in the tweets where USER, HASHTAG, and URL tags are used to refer to users, hashtags, and URLs within the tweets themselves. Analyzing the tweets, being a task with 200 tweets for each user, we have a total of 84,000 tweets, where we have 749 repeated tweets. In the case of the repeated tweets, since they are so few in proportion to the data set, they do not negatively affect the performance of the models. On the other hand, the most used words by class have been analyzed, to observe if there is any pattern that repeats significantly in ironic or non-ironic authors. Eliminating USER, HASHTAG, URL, and stopwords, the texts have been used to construct the word clouds for ironic and non-ironic authors (Figure 1). There is hardly any difference in the words most frequently used for each type of author, although in non-ironic authors the word ”women” is quite frequent, while in ironic authors the word ”Trump” is one of the most repeated. Figure 1: Wordclouds for ironic (left) and non-ironic (right) user tweets. 3.2. Models and Evaluation Before presenting the different Machine Learning models for the task of author profiling, it is necessary to study which forms of tweet preprocessing best fit the task. Two different approaches have been used, the first one is the use of the classical TF-IDF vectorizer, the second one is the BERT tokenizer, and finally the BERT Sentence Embedding. For comparison, the experiments performed are presented in Table 1, where they have been tested with Support Vector Machines (SVM), Multilayer Perceptron (MLP), and Recurrent Neural Network (RNN). It should be noted, as mentioned in the Methodology section, that the experiments were performed using 10-Fold Cross-Validation. It can be seen from the results of Table 1, that with the use of BERT Sentence Embedding better results are obtained. Since the model is trained for capturing the relationship between words taking into account the context, and performance is achieved that in comparison to the TF-IDF Vectorizer and BERT Tokenizer techniques is much higher. Therefore in the following experiments it was decided to use BERT Sentence Embedding for feature extraction. Table 1 Accuracy of different feature extraction for tweets. Accuracy using Accuracy using Accuracy using BERT Model TF-IDF Vectorizer BERT Tokenizer Sentence Embedding SVM 0.67 0.72 0.93 MLP(5 layers) 0.76 0.79 0.92 RNN(Bidirectional LSTM) 0.82 0.83 0.93 Having studied that BERT Sentence Embedding is the best text representation, it is interesting to analyze if the preprocessing of USER, HASHTAG, and URL tags can provide any improvement compared to no preprocessing. The removal of these terms cause a worsening in the accuracy of 1% in the models of Table 1, which may be because in the cases of ironic users, the use of hashtags is quite widespread and this helps the classifiers to find patterns and improve the understanding of what is irony detection. Regarding the Machine Learning models used, two different scenarios are presented: on one hand, classical techniques are applied, and on the other one, more advanced techniques such as Artificial Neural Networks. For the classical models, about 10,000 experiments have been carried out comparing different techniques such as Decision Trees, Logistic Regression, Gaussian Naive Bayes, Multinomial Naive Bayes, Support Vector Machine, Bernoulli Naive Bayes, K-Nearest Neighbors, Logistic Regression with different preprocessing techniques on the BERT Sentence Embedding vector such as Binarizer, Feature Agglomeration, MaxAbsScaler, MinMaxScaler, Normalizer, Principal Component Analysis, RBFSampler, Robust Scaler, StandardScaler, ZeroCount. The most signifi- cant results of the different combinations are presented in Table 2, where the best accuracy obtained by far is 0.94 using Logistic Regression with Normalizer L2, Robust Scaler and a Variance Threshold of 0.005. Table 2 Most relevant experiments with classical algorithms. Model Accuracy Decision Tree gini and max-depth=3 0.89 K-NearestNeighbors k=70, weights=distance and power=1 0.75 Gaussian Naive Bayes 0.88 MultinomialNB alpha=1 and fit𝑝 𝑟𝑖𝑜𝑟 = 𝐹 𝑎𝑙𝑠𝑒 0.88 Logistic Regression penalty=l2, C=5 and dual=False 0.94 BernoulliNB alpha=100 and fit𝑝 𝑟𝑖𝑜𝑟 = 𝑇 𝑟𝑢𝑒 0.90 Support Vector Machine C=10 y kernel=rbf 0.93 Currently, the best performing models are artificial neural networks due to their great generalization capacity and high performance, so a wide family of neural models is applied in this task. Five different types of architectures have been developed, a 5-layer Multilayer Perceptron, a bilinear convolutional neural network, and three different types of bidirectional LSTM RNN, a simple one, one with 1D Convolution mechanisms and another with 1D Convolutions and Self-Attention. Table 3 presents the different architectures with their corresponding accuracies. It can be seen that the neural methods obtain more or less the same results among them, surpassing on average the classical methods in Table 2. Despite being better on average than the classical methods, the highest accuracy reached is 0.94 for the classical Logistic Regression algorithm. Table 3 Most relevant experiments with neural algorithms. Model Architecture Accuracy Multilayer Perceptron (5 layers) 0.92 Bilineal CNN (BERT Sentence Embedding to Image) 0.93 Bidirectional LSTM 0.93 Conv. 1D + Bidirectional LSTM 0.93 Bidirectional LSTM + Self Attention + Conv. 1D 0.92 The main idea when training the Multilayer Perceptron or the classical methods is to average the vectors calculated by BERT for each author. That is, as each author has 200 tweets, BERT returns 200 vectors of dimension 784, so an average of these 200 vectors is performed to obtain only one. While in the rest of the neural models, the 200 BERT vectors have been used, an architecture that can perform well for this task it’s the Convolutional 1D with Bidirectional LSTM layers, presented in Figure 3, called Iro-Net for simplicity. Using all BERT vectors is enriching. All state-of-the-art neural models are able to achieve better results due to the large amount of information. Because of this, it has been decided to develop a neural network using convolution mechanisms together with a bidirectional LSTM layer, named Iro-Net and inspired by the model proposed by [14]. Moreover, in the field of Natural Language Processing, models that implement attention mechanisms such as Transformers are the standard. Therefore Iro-Net incorporates a Self-Attention layer after the recurrent LSTM layer. The Iro-Net architecture is shown in Figure 2 and has been designed specifically for this work. 3.3. Iro-Net Architecture The Iro-Net architecture and its corresponding hyperparameters are presented in this section. It should be noted that the parametrization described corresponds to the best model submitted to Tira [15]. This artificial neural network is composed of different layers that perform different functions. The first layer corresponds to the Sentence Embedding of BERT and is in charge of handling the 200 tweets of each user and obtaining an interesting representation of them. A Bidirectional LSTM layer is applied to the Embeddings of the tweets and tries to perform a context-aware representation of the tweets. At the output of the Bidirectional LSTM layer is a Self-Attention layer, in order to discriminate which words or conditions are the most important when detecting irony. The next layer is a 1D Convolutional with a corresponding Max-Pooling layer that performs well for pattern recognition. A Residual connection from the output of the Bidirectional LSTM is added to the Max-Pooling output. Residual connections are widely used in Deep Figure 2: Iro-Net Architecture. Learning. Specifically, this design achieves a more discriminative representation, because of this it has been considered interesting to apply it. Finally, a Global Max-Pooling layer, a layer of 256 neurons and a Softmax output in charge of calculating the probability of whether the author is ironic or not are added. Regarding the Hyperparameters used it is interesting to mention the following. All layers use Glorot weight initialization. In Glorot initialization the biases are initialized to 0 and the weights are calculated with the uniform distribution taking into account the size of the previous layer. The layers: Bidirectonal LSTM, Max-Pooling and the penultimate layer with 256 neurons use the Dropout technique with values between 0.2 and 0.3. In addition, it is interesting to mention that all layers use the L2 regularizer with a value of 0.00005. As for the Hyperparameters of the training, the Adam optimizer with a learning rate of 0.001 has been used. During training it is important to use a Learning Rate Annealing technique because changing the learning rate can improve the performance of the model and not stagnate at local minima. The best performing technique is Learning Rate On Plateau with a minimum learning rate value of 0.000001. Finally, the training consists of 100 epochs and a batch size of 16. 4. Discussion and Conclusion As has been studied in the Related Work section, the use of BERT embeddings and the contextual relationship that can be obtained by using them is a key point in any natural language task. In this work, BERT’s Sentence Embeddings have been compared against TF-IDF vectorizer and BERT’s tokenizer, with Sentence Embedding being the best representation with a significant difference. Furthermore, it’s worth noting that the power of BERT’s Sentence Embeddings representation is so powerful, that such a process does not need any kind of preprocessing. This opens new application frontiers, which in the not too distant future will allow Embeddings techniques to be used for other types of problems in the Machine Learning field. With the reviewed bibliography, the classical models in the task of author profiling seem to have in their majority better performance than the neural ones, but we think that the cause of this is due to the reduced number of data that we have. If in the future we get more Twitter profiles labeled as ironic or non-ironic, it would be interesting to replicate the experiments corresponding to Tables 2 and 3, but for the moment, the best accuracy obtained is 0.94 using BERT’s Sentence Embedding and Logistic Regression classifier. As already mentioned, the detection of ironic profiles is an important task with many ap- plications, both for directing and managing social media, as well as for sociological studies. Therefore, stressing the importance of a necessary continuous and in-depth improvement in this area is necessary. In addition, another point to highlight about the work done is that despite how mainstream artificial neural network methods have become, for the moment we should not forget the classic algorithms, such as SVM, Decision Trees, Logistic Regression, and so on. Since the PAN contest for irony detection consists of two phases, in the first phase (early bird) it has been decided to deliver two models, a classical one and a neural one. The classical algorithm delivered is Logistic Regression, with the parameterization presented in Table 2 and the neural network presented is Iro-Net, whose architecture is shown in Figure 3. In the results of the first phase, the best model among the two phases was Iro-Net with 96.11% accuracy, so in the second phase (final submission) the Iro-Net model will be presented as the definitive model. As future work and possible extensions, using BERT as a pre-trained model and performing fine-tuning with different variations of BERT, such as distilled BERT, could bring improvements in the performance of ironic profile detection models based on tweets. References [1] O.-B. Reynier, C. Berta, R. Francisco, R. Paolo, F. Elisabetta, Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) at PAN 2022, 2022. URL: https://pan.webis.de/clef22/ pan22-web/author-profiling.html. [2] A. Reyes, P. Rosso, T. Veale, A multidimensional approach for detecting irony in twitter, Language resources and evaluation 47 (2013) 239–268. [3] J. Karoui, F. Benamara, V. Moriceau, N. Aussenac-Gilles, L. H. Belguith, Towards a contex- tual pragmatic model to detect irony in tweets, in: 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), 2015, pp. PP–644. [4] B. Ghanem, J. Karoui, F. Benamara, V. Moriceau, P. Rosso, Idat at fire2019: Overview of the track on irony detection in arabic tweets, in: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019, pp. 10–13. [5] F. Rangel, G. L. De la Peña Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling hate speech spreaders on twitter task at pan 2021., in: CLEF (Working Notes), 2021, pp. 1772–1789. [6] F. Rangel, P. Rosso, Pan19 author profiling: Bots and gender profiling, 2019. URL: https: //doi.org/10.5281/zenodo.3692340. doi:1 0 . 5 2 8 1 / z e n o d o . 3 6 9 2 3 4 0 . [7] F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, G. Inches, Overview of the author profiling task at pan 2013, in: CLEF Conference on Multilingual and Multimodal Information Access Evaluation, CELCT, 2013, pp. 352–365. [8] E. R. Weren, A. U. Kauer, L. Mizusaki, V. P. Moreira, J. P. M. de Oliveira, L. K. Wives, Exam- ining multiple features for author profiling, Journal of information and data management 5 (2014) 266–266. [9] E. Gose, Pattern recognition and image analysis (1997). [10] F. Rangel, P. Rosso, Overview of the 7th author profiling task at pan 2019: bots and gender profiling in twitter, in: Working Notes Papers of the CLEF 2019 Evaluation Labs Volume 2380 of CEUR Workshop, 2019. [11] F. Rangel, A. Giachanou, B. H. H. Ghanem, P. Rosso, Overview of the 8th author profiling task at pan 2020: Profiling fake news spreaders on twitter, in: CEUR Workshop Proceedings, volume 2696, Sun SITE Central Europe, 2020, pp. 1–18. [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [13] E. Alzahrani, L. Jololian, How different text-preprocessing techniques using the bert model affect the gender profiling of authors, arXiv preprint arXiv:2109.13890 (2021). [14] M. Polignano, M. d. Gemmis, G. Semeraro, Contextualized bert sentence embeddings for author profiling: The cost of performances, in: International Conference on Computational Science and Its Applications, Springer, 2020, pp. 135–149. [15] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:1 0 . 1 0 0 7 / 978- 3- 030- 22948- 1\_5.