An Ensemble Machine Learning Classifier for Profiling Irony and Stereotype Spreaders on Twitter Notebook for PAN at CLEF 2022 Zengyao Li, Zhongyuan Han*, Mingjie Huang, Leilei Kong Foshan University, Foshan, China Abstract The Profiling Irony and Stereotype Spreaders on Twitter (profiling IROSTEREO) task is to judge which author can be considered ironic based on the author's comments. We treat this task as a text binary classification task. This paper proposes a feature extraction method based on a pre-trained language model and a classifier based on an ensemble machine learning model. Our proposed method and model achieve 0.9222 accuracy on the test set for this task. Keywords 1 Pre-trained model, Classification, Irony and Stereotype Spreaders. 1. Introduction With irony, language is employed in a figurative and subtle way to mean the opposite of what is literally stated. In the case of sarcasm, a more aggressive type of irony, the intent is to mock or scorn a victim without excluding the possibility of being hurt [1]. Irony as a literary technique is widely used in online texts such as Twitter tweets. The task of Profiling Irony and Stereotype Spreaders on Twitter (profiling IROSTEREO [1, 2]) is presented in this background. Typically, in author analysis tasks ,such as profiling Hate Speech Spreaders(HSSs) on Twitter task [3], the context representing the positive or negative intent of the text is consistent. However, in the profiling IROSTEREO task, a text’s ironic intent is defined by its context incongruity. For example, in the phrase “I love being ignored”, irony is defined by the incongruity between the positive word “love” and the negative context of “being ignored” [4]. This task aims to find those authors that can be considered ironic through their tweets on the Twitter author. We use the pre-trained language model BERT [5] for this task to extract the features and propose a classifier model based on an ensemble machine learning model. The rest of this paper is organized as follows. The related work and methodology are discussed in Section 2 and 3. The experimental setup and results are subsequently reported in Section 4 and eventually concludes with summarize in Section 5. 2. Related Work The profiling Hate Speech Spreaders on Twitter task [3] proposed by Pan last year is similar to this task. We explore it as a text classification task. A previous method for profiling HSSs, like classifying an author as HSS (Hate Speech Spreader) or not , takes advantage of a CNN based on a single convolutional layer [6]. In addition, hate speech spreader detection using n-grams and voting classifier CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy EMAIL: lzy1512192979@gmail.com (Z. Li); hanzhongyuan@gmail.com (Z. Han)(*corresponding author); mingjiehuang007@163.com (M. Huang); kongleilei@fosu.edu.cn (L. Kong) ORCID: 0000-0001-8472-4150 ;(Z. Li)0000-0001-8960-9872 (Z. Han); 0000-0002-0889-5027 (M. Huang); 0000-0002-4636-3507(L. Kong) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Proceedings also achieved good results [7]. And it also works well for deep modeling of latent representations based on Transformer [8] model for profiling HSSs task [9]. Furthermore, we will introduce some state-of-the-art methods used in the paper for profiling IROSTEREO tasks. In the paper by Shiwei Zhang, the authors formulate irony detection instead as a transfer learning task where supervised learning on irony labeled text is enriched with knowledge transferred from external sentiment analysis resources. Importantly, they focus on identifying the hidden, implicit incongruity without relying on explicit incongruity expressions [4]. In the ref[10], the authors use two different interpretable methods to identify stereotypes about immigration: Transformer-based deep learning models and text masking techniques . Presently, pre-trained language models are the mainstream models, and they have been tested to outperform other models in evaluation metrics on most tasks. At the same time, traditional machine learning classifiers are simple and effective for binary classification tasks. In the last year, participants in the profiling HSSs task chose only one of the models to complete the task. So, is it feasible to combine pre-trained languages and traditional machine learning models? In addition to this, we also looked up some related research on text classification. Hybrid methods exist in the literature, such as CNNs for extracting text features and SVMs for performing classification and prediction [6, 11]. Finally, similar results can be obtained with CNN [6]. So, combining pre-trained language models and machine learning classifiers to train and predict data is worth trying. 3. Method This section gives a brief overview of our model, training process, and prediction process. Our proposed model mainly consists of the following two parts: 1. Fine-tuning Bert for feature extraction 2. Ensemble Machine Learning Classifier We describe these two parts in Sections 3.1 and 3.2. Sections 3.3 and 3.4 will explain our training and prediction process in detail. 3.1. Fine-tuning Bert Figure 1: In the structure of the fine-tuned Bert(inside the red frame), we will use the hidden layer vector output from the last layer of Bert as a feature. The raw text is preprocessed before training and prediction, as detailed below. The preprocessed text will be given the same label as the author of the text, and then fed into the Bert model one by one for fine-tuning. When the accuracy rate on the validation set no longer increased within two epochs, the training was stopped, and the model with the highest accuracy rate was saved. In the feature extraction stage, we can extract 768-dimensional cls tokens through the trained model. The structure of the fine-tune Bert model is shown in Figure 1. 3.2. Ensemble Machine Learning Classifier (EMLC) We build Ensemble Machine Learning Classifier (EMLC), which integrates LR, RF, and SVM models, and the specific structure is shown in the red box in Figure 2. In Figure 2, xi represents the input text, and the fine-tuned Bert is used to extract its feature representation and train it as the input of the EMLC. Then the probabilities output by the LR, RF, and SVM models are averaged, and the class with the highest probability is taken as the final prediction result yi. Figure 2: The structure of the EMLC. The training process of our model is mainly divided into two parts: training of Bert model and training of EMLC. We feed the split 5 data (see section 4 for details of data) into five initial Bert models for training and end up with five fine-tuned Bert models, the first part of training. In the second part of the training process, we use the total training set to input these 5 fine-tuned Bert models for feature extraction, resulting in 5 feature representations. These feature representations are then fed into five EMLCs for training, which finally completes all training of the model. The training process of EMLC is shown in Figure 3. The xi on the right side of the figure represents the text in the total training set. Figure 3: The left side of the figure is the Bert k pre-trained based on Data k. The right side of the figure shows the training process of EMLC. The features of xi are extracted by the fine-tuned Bert k model and then trained in the corresponding EMLC k (k = 1, 2, 3, 4, 5). 4. Experiment and Results 4.1. Dataset The English dataset provided by the task organizer consists of two parts: a training set and a test set. The training set consists of 840,000 tweets: the training set has 420 authors, each author assigns a label, and each author has 200 tweets. The test set consists of 360,000 tweets: the test set contains 180 authors with 200 tweets per author. Table 1 shows the data analysis of the dataset. Table 1 The detail of profiling IROSTEREO datasets Dataset Author Tweets Labels Train dataset 420 84000 420 Test dataset 180 36000 None 4.2. Text Preprocessing For each author's 200 tweets, we first remove some unusual symbols and strings, then convert all text to lowercase, and merge every eight tweets into a new tweet in sequence so that An author gets 25 new tweets. In addition, we also divided the training set into five parts according to the idea of 5fold, named Data1~5 respectively (as shown in Figure 4). Each data is independently trained to obtain an independent Bert model. Figure 4: Construction diagram of data1~5. T stands for the training set, and V stands for the validation set. 4.3. Experimental setting In this work, the pre-trained language model chosen for the first part of our model is Bertbase 1 (L=12, H=768, A=12, Total Parameters=110M). Specifically, the implementation of HuggingFace2 called BertForSequenceClassification is used. During the fine-tuning stage of the pre- trained model, we set batch_size=25 and used cross-entropy as Bert's loss function. As the optimizer, we choose AdamW, and the learning rate is set to 1e-5. For the second part of the model, we employ an integrated machine learning classifier of LR, RF, and SVM. For these three machine learning models, we chose to use the default settings and set the “voting” parameter of the VotingClassifier to “soft”. We use the PyTorch framework for the whole model to conduct our experiments. Our source code is publicly available at https://github.com/Zero-Lzy/Pan_2022_Twitter. 4.4. Model Prediction Process 1 https://github.com/google-research/bert 2 https://huggingface.co/ When making predictions, we first preprocess the text of the dataset, use five fine-tuned Bert models to extract features from the text, and then input the extracted features into the corresponding EMLC. Five EMLCs will get five labels about the text and cast hard votes 1 on these five labels to get the final label of the text. The acquisition process of text labels is consistent with the training process, as shown on the right side of Figure 3, but at this time, xi represents the preprocessed text in the test set. And for the task, what we need to predict is the author label. After the dataset is preprocessed, one author will have 25 texts. That is, 25 text labels will be predicted. Based on these 25 text labels, we choose the classification with the most votes as the final author label. The prediction process is shown in Figure 5. Figure 5: Predict author label based on author’s preprocessed tweets. 4.5. Result We conducted experiments with 5-fold cross-validation on the training set. The dataset was folded five times as described in subsection 4.2. Table 2 reports the accuracy obtained on the validation set used at each fold, along with the arithmetic mean and standard deviation. As can be seen from Table 2, our cross-validation experiments achieved an average accuracy of 0.9976. We believe that our model is relatively reliable. Table 2 Results were achieved by the model on a 5-fold cross-validation on the complete training set. Fold 1 2 3 4 5 Avg. Accuracy 0.988 1.000 1.000 1.000 1.000 0.9976 Table 3 Results achieved by our model on the test set Rank Accuracy 40 0.9222 Finally, we compress the results predicted on the test set and upload them to TIRA [12]. As reported on the PAN website, in the test set given by the organizer, our evaluation metrics accuracy can reach 0.9222 as shown in Table 3. The accuracy differs from the result on the validation set by 0.0754. It may be because the validation set is too small or the correlation of the split data is high, 1 minority obeys the principle of majority. For example, three labels out of 5 labels are 0, 2 labels are 1, and The final label is 0 resulting in the accuracy of the validation set being too high. At the same time, there may be some unsolvable overfitting problems in the model. 5. Conclusion To address the profiling IROSTEREO task proposed by PAN2022, we offer a feature extraction method based on pre-trained language models and a classifier based on ensemble machine learning models in this paper. At the same time, to solve the problem of data underfitting, we constructed multiple datasets for multiple training and voting. To solve the problem of data overfitting, we use early stopping. In the end, we reached 90% on the accuracy score. Therefore, our proposed method is still effective for this task. 6. Acknowledgments This work is supported by the Natural Science Foundation of Guangdong Province, China (No. 2022A1515011544). 7. References [1] Ortega-Bueno R., Chulvi B., Rangel F., Rosso P. and Fersini E. Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) at PAN 2022.In: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2022. [2] J. Bevendorff, B. Chulvi, E. Fersini, et al. Overview of PAN 2022: Authorship Verification, Profiling Irony and Stereotype Spreaders, and Style Change Detection. In: Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture Notes in Computer Science. Springer, 2022. [3] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech Spreaders on Twitter Task at PAN 2021, in: A. J. M. M. F. P. Guglielmo Faggioli, Nicola Ferro (Ed.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [4] Zhang S., Zhang X., Chan J., Rosso P. Irony Detection via Sentiment-based Transfer Learning. In: Information Processing & Management, vol. 56, issue 5, pp. 1633-1644, 2019. [5] Devlin J., Chang M.W., Lee K., et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171- 4186, 2019. [6] M. Siino, E. D. Nuovo, I. Tinnirello, M. L. Cascia, Detection of Hate Speech Spreaders using Convolutional Neural Networks, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [7] F. Balouchzahi, S. H. L., G. Sidorov, HSSD: Hate Speech Spreader Detection using N-grams and Voting Classifier, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [8] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 6000–6010, 2017. [9] R. L. Tamayo, D. Castro-Castro, R. O. Bueno, Deep Modeling of Latent Representations for Twitter Profiles on Hate Speech Spreaders Identification. Notebook for PAN at CLEF 2021, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [10] Sánchez-Junquera J., Rosso P., Montes-y-Gómez M., Chulvi B. Masking and BERT-based Models for Stereotype Identification. In: Procesamiento del Lenguaje Natural (SEPLN), num. 67, pp. 83-94, 2021. [11] Z. Wang, Z. Qu, Research on web text classification algorithm based on improved cnn and svm, in: 2017 IEEE 17th International Conference on Communication Technology (ICCT), IEEE, pp. 1958–1961, 2017. [12] M. Potthast, T. Gollub, M. Wiegmann, and B. Stein. TIRA Integrated Research Architecture. In Nicola Ferro and Carol Peters, editors, Information Retrieval Evaluation in a Changing World, The Information Retrieval Series. Springer, Berlin Heidelberg New York, September 2019.