An Ensemble Machine Learning Classifier for Profiling Irony and Stereotype Spreaders on Twitter Notebook for PAN at CLEF 2022

An Ensemble Machine Learning Classifier for Profiling Irony and Stereotype Spreaders on Twitter Notebook for PAN at CLEF 2022 ZengyaoLi ZhongyuanHan hanzhongyuan@gmail.com MingjieHuang mingjiehuang007@163.com LeileiKong kongleilei@fosu.edu.cn Foshan University

Foshan China

Evaluation Forum

September 5-8 2022 Bologna Italy

An Ensemble Machine Learning Classifier for Profiling Irony and Stereotype Spreaders on Twitter Notebook for PAN at CLEF 2022 1D7D83068C91A9B378595201F9877620 GROBID - A machine learning software for extracting information from scholarly documents Pre-trained model, Classification, Irony and Stereotype Spreaders 0000-0001-8472-4150 (Z. Li)0000-0001-8960-9872 (Z. Han) 0000-0002-0889-5027 (M. Huang) 0000-0002-4636-3507(L. Kong)

The Profiling Irony and Stereotype Spreaders on Twitter (profiling IROSTEREO) task is to judge which author can be considered ironic based on the author's comments. We treat this task as a text binary classification task. This paper proposes a feature extraction method based on a pre-trained language model and a classifier based on an ensemble machine learning model. Our proposed method and model achieve 0.9222 accuracy on the test set for this task.

Introduction

With irony, language is employed in a figurative and subtle way to mean the opposite of what is literally stated. In the case of sarcasm, a more aggressive type of irony, the intent is to mock or scorn a victim without excluding the possibility of being hurt [1]. Irony as a literary technique is widely used in online texts such as Twitter tweets.

The task of Profiling Irony and Stereotype Spreaders on Twitter (profiling IROSTEREO [1,2]) is presented in this background. Typically, in author analysis tasks ,such as profiling Hate Speech Spreaders(HSSs) on Twitter task [3], the context representing the positive or negative intent of the text is consistent. However, in the profiling IROSTEREO task, a text's ironic intent is defined by its context incongruity. For example, in the phrase "I love being ignored", irony is defined by the incongruity between the positive word "love" and the negative context of "being ignored" [4]. This task aims to find those authors that can be considered ironic through their tweets on the Twitter author. We use the pre-trained language model BERT [5] for this task to extract the features and propose a classifier model based on an ensemble machine learning model.

The rest of this paper is organized as follows. The related work and methodology are discussed in Section 2 and 3. The experimental setup and results are subsequently reported in Section 4 and eventually concludes with summarize in Section 5.

Related Work

The profiling Hate Speech Spreaders on Twitter task [3] proposed by Pan last year is similar to this task. We explore it as a text classification task. A previous method for profiling HSSs, like classifying an author as HSS (Hate Speech Spreader) or not , takes advantage of a CNN based on a single convolutional layer [6]. In addition, hate speech spreader detection using n-grams and voting classifier also achieved good results [7]. And it also works well for deep modeling of latent representations based on Transformer [8] model for profiling HSSs task [9].

Furthermore, we will introduce some state-of-the-art methods used in the paper for profiling IROSTEREO tasks. In the paper by Shiwei Zhang, the authors formulate irony detection instead as a transfer learning task where supervised learning on irony labeled text is enriched with knowledge transferred from external sentiment analysis resources. Importantly, they focus on identifying the hidden, implicit incongruity without relying on explicit incongruity expressions [4]. In the ref [10], the authors use two different interpretable methods to identify stereotypes about immigration: Transformer-based deep learning models and text masking techniques .

Presently, pre-trained language models are the mainstream models, and they have been tested to outperform other models in evaluation metrics on most tasks. At the same time, traditional machine learning classifiers are simple and effective for binary classification tasks. In the last year, participants in the profiling HSSs task chose only one of the models to complete the task. So, is it feasible to combine pre-trained languages and traditional machine learning models? In addition to this, we also looked up some related research on text classification. Hybrid methods exist in the literature, such as CNNs for extracting text features and SVMs for performing classification and prediction [6,11]. Finally, similar results can be obtained with CNN [6]. So, combining pre-trained language models and machine learning classifiers to train and predict data is worth trying.

Method

This section gives a brief overview of our model, training process, and prediction process. Our proposed model mainly consists of the following two parts: 1. Fine-tuning Bert for feature extraction 2. Ensemble Machine Learning Classifier

We describe these two parts in Sections 3.1 and 3.2. Sections 3.3 and 3.4 will explain our training and prediction process in detail. The raw text is preprocessed before training and prediction, as detailed below. The preprocessed text will be given the same label as the author of the text, and then fed into the Bert model one by one for fine-tuning. When the accuracy rate on the validation set no longer increased within two epochs, the training was stopped, and the model with the highest accuracy rate was saved. In the feature extraction stage, we can extract 768-dimensional cls tokens through the trained model. The structure of the fine-tune Bert model is shown in Figure 1.

Fine-tuning Bert

Ensemble Machine Learning Classifier (EMLC)

We build Ensemble Machine Learning Classifier (EMLC), which integrates LR, RF, and SVM models, and the specific structure is shown in the red box in Figure 2. In Figure 2, xi represents the input text, and the fine-tuned Bert is used to extract its feature representation and train it as the input of the EMLC. Then the probabilities output by the LR, RF, and SVM models are averaged, and the class with the highest probability is taken as the final prediction result yi.

Experiment and Results

Dataset

The English dataset provided by the task organizer consists of two parts: a training set and a test set. The training set consists of 840,000 tweets: the training set has 420 authors, each author assigns a label, and each author has 200 tweets. The test set consists of 360,000 tweets: the test set contains 180 authors with 200 tweets per author. Table 1 shows the data analysis of the dataset.

Text Preprocessing

For each author's 200 tweets, we first remove some unusual symbols and strings, then convert all text to lowercase, and merge every eight tweets into a new tweet in sequence so that An author gets 25 new tweets. In addition, we also divided the training set into five parts according to the idea of 5fold, named Data1~5 respectively (as shown in Figure 4). Each data is independently trained to obtain an independent Bert model.

Experimental setting

In this work, the pre-trained language model chosen for the first part of our model is Bertbase1 (L=12, H=768, A=12, Total Parameters=110M). Specifically, the implementation of HuggingFace2 called BertForSequenceClassification is used. During the fine-tuning stage of the pretrained model, we set batch_size=25 and used cross-entropy as Bert's loss function. As the optimizer, we choose AdamW, and the learning rate is set to 1e-5. For the second part of the model, we employ an integrated machine learning classifier of LR, RF, and SVM. For these three machine learning models, we chose to use the default settings and set the "voting" parameter of the VotingClassifier to "soft". We use the PyTorch framework for the whole model to conduct our experiments. Our source code is publicly available at https://github.com/Zero-Lzy/Pan_2022_Twitter.

Model Prediction Process

When making predictions, we first preprocess the text of the dataset, use five fine-tuned Bert models to extract features from the text, and then input the extracted features into the corresponding EMLC. Five EMLCs will get five labels about the text and cast hard votes1 on these five labels to get the final label of the text. The acquisition process of text labels is consistent with the training process, as shown on the right side of Figure 3, but at this time, xi represents the preprocessed text in the test set. And for the task, what we need to predict is the author label. After the dataset is preprocessed, one author will have 25 texts. That is, 25 text labels will be predicted. Based on these 25 text labels, we choose the classification with the most votes as the final author label. The prediction process is shown in Figure 5.

Result

We conducted experiments with 5-fold cross-validation on the training set. The dataset was folded five times as described in subsection 4.2. Table 2 reports the accuracy obtained on the validation set used at each fold, along with the arithmetic mean and standard deviation. As can be seen from Table 2, our cross-validation experiments achieved an average accuracy of 0.9976. We believe that our model is relatively reliable. Finally, we compress the results predicted on the test set and upload them to TIRA [12]. As reported on the PAN website, in the test set given by the organizer, our evaluation metrics accuracy can reach 0.9222 as shown in Table 3. The accuracy differs from the result on the validation set by 0.0754. It may be because the validation set is too small or the correlation of the split data is high, resulting in the accuracy of the validation set being too high. At the same time, there may be some unsolvable overfitting problems in the model.

Conclusion

To address the profiling IROSTEREO task proposed by PAN2022, we offer a feature extraction method based on pre-trained language models and a classifier based on ensemble machine learning models in this paper. At the same time, to solve the problem of data underfitting, we constructed multiple datasets for multiple training and voting. To solve the problem of data overfitting, we use early stopping. In the end, we reached 90% on the accuracy score. Therefore, our proposed method is still effective for this task.

Figure 1 :1Figure 1: In the structure of the fine-tuned Bert(inside the red frame), we will use the hidden layer vector output from the last layer of Bert as a feature.The raw text is preprocessed before training and prediction, as detailed below. The preprocessed text will be given the same label as the author of the text, and then fed into the Bert model one by one for fine-tuning. When the accuracy rate on the validation set no longer increased within two epochs, the training was stopped, and the model with the highest accuracy rate was saved. In the feature extraction stage, we can extract 768-dimensional cls tokens through the trained model. The structure of the fine-tune Bert model is shown in Figure1.

Figure 2 :2Figure 2: The structure of the EMLC. The training process of our model is mainly divided into two parts: training of Bert model and training of EMLC. We feed the split 5 data (see section 4 for details of data) into five initial Bert models for training and end up with five fine-tuned Bert models, the first part of training. In the second part of the training process, we use the total training set to input these 5 fine-tuned Bert models for feature extraction, resulting in 5 feature representations. These feature representations are then fed into five EMLCs for training, which finally completes all training of the model. The training process of EMLC is shown in Figure 3. The xi on the right side of the figure represents the text in the total training set.

Figure 3 :3Figure 3: The left side of the figure is the Bert k pre-trained based on Data k. The right side of the figure shows the training process of EMLC. The features of x i are extracted by the fine-tuned Bert k model and then trained in the corresponding EMLC k (k = 1, 2, 3, 4, 5).

Figure 4 :4Figure 4: Construction diagram of data1~5. T stands for the training set, and V stands for the validation set.

Figure 5 :5Figure 5: Predict author label based on author's preprocessed tweets.

Table 1 The1detail of profiling IROSTEREO datasetsDatasetAuthorTweetsLabelsTrain dataset42084000420Test dataset18036000None

Table 22Results were achieved by the model on a 5-fold cross-validation on the complete training set.FoldAccuracy1 0.9882 1.0003 1.0004 1.0005 1.000Avg. 0.9976Table 3Results achieved by our model on the test setRankAccuracy400.9222

https://github.com/google-research/bert https://huggingface.co/ minority obeys the principle of majority. For example, three labels out of 5 labels are 0, labels are 1, and The final label is 0

Acknowledgments

This work is supported by the Natural Science Foundation of Guangdong Province, China (No. 2022A1515011544).

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) at PAN 2022 ROrtega-Bueno BChulvi FRangel PRosso EFersini CLEF 2022 Labs and Workshops Notebook Papers CEUR-WS 2022 Overview of PAN 2022: Authorship Verification, Profiling Irony and Stereotype Spreaders, and Style Change Detection JBevendorff BChulvi EFersini Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022) Lecture Notes in Computer Science the Thirteenth International Conference of the CLEF Association (CLEF 2022) Springer 2022 13390 Profiling Hate Speech Spreaders on Twitter Task at PAN FRangel GL D L PSarracén BChulvi EFersini PRosso CLEF 2021 Labs and Workshops Notebook Papers AJ M M F PGuglielmo Faggioli NicolaFerro CEUR-WS 2021. 2021 Irony Detection via Sentiment-based Transfer Learning SZhang XZhang JChan PRosso Information Processing & Management 56 5 2019 Pre-training of deep bidirectional transformers for language understanding JDevlin MWChang KLee Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019 Detection of Hate Speech Spreaders using Convolutional Neural Networks MSiino EDNuovo ITinnirello MLCascia CLEF 2021 Labs and Workshops Notebook Papers GFaggioli NFerro AJoly MMaistro FPiroi CEUR-WS 2021 HSSD: Hate Speech Spreader Detection using N-grams and Voting Classifier FBalouchzahi SH L GSidorov CLEF 2021 Labs and Workshops Notebook Papers GFaggioli NFerro AJoly MMaistro FPiroi CEUR-WS 2021 Attention is all you need AVaswani NShazeer NParmar 31st Conference on Neural Information Processing Systems (NIPS 2017)

Long Beach, CA, USA

2017 Deep Modeling of Latent Representations for Twitter Profiles on Hate Speech Spreaders Identification. Notebook for PAN at CLEF RLTamayo DCastro-Castro ROBueno CLEF 2021 Labs and Workshops Notebook Papers GFaggioli NFerro AJoly MMaistro FPiroi CEUR-WS 2021. 2021 Masking and BERT-based Models for Stereotype Identification JSánchez-Junquera PRosso MMontes-Y-Gómez BChulvi Procesamiento del Lenguaje Natural (SEPLN) esamiento del Lenguaje Natural (SEPLN) 2021 67 Research on web text classification algorithm based on improved cnn and svm ZWang ZQu 2017 IEEE 17th International Conference on Communication Technology (ICCT), IEEE 2017 TIRA Integrated Research Architecture MPotthast TGollub MWiegmann BStein Information Retrieval Evaluation in a Changing World, The Information Retrieval Series NicolaFerro CarolPeters

Berlin Heidelberg New York

Springer September 2019