UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining Linguistic Features with Transformers

UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining Linguistic Features with Transformers JoséAntonioGarcía-Díaz Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

MiguelÁngel Rodríguez-García Departamento de Ciencias de la Computación Universidad Rey Juan Carlos

28933 Madrid Spain

FranciscoGarcía-Sánchez Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

RafaelValencia-García Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

Evaluation Forum

September 5-8 2022 Bologna Italy

UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining Linguistic Features with Transformers 1613-0073 3542F98315EF33321543370C5F2F17CB GROBID - A machine learning software for extracting information from scholarly documents Author Profiling Irony and Stereotypes Stance detection Feature Engineering Deep Learning Transformers Natural Language Processing

Irony is a curious mode of communication in which the speaker says something that wants the audience to be interpreted oppositely. Its automatic detection is a very challenging task due to its complex interpretation, and it has a significant potential for various applications in text mining. Social Media platforms like Twitter offer a vital chance to analyze this literary technique since users frequently utilize it to give their opinions. In this working note, we describe the contribution designed for the PAN's shared author profiling task and its subtask concerning Stereotype Stance Detection. The former consists in determining whether the authors spread irony and stereotypes and the latter is focused on identifying stereotypes that can hurt vulnerable groups. The organizers provide a set compiled from Twitter to carry out the task. In particular, we have proposed a supervised learning pipeline consisting of a combination of Deep Learning techniques that utilizes context and non-context embeddings to address the binary classification. The resulting system reaches promising results, achieving the fifth-best score in the main task with an accuracy of 96.67%.

Introduction

With the proliferation of social media, irony has made it one of the most literary device utilized in this communication manner [1]. Several definitions have been provided in the literature about irony, but they concurred with the same binary classification, verbal and situational irony. The former has been conceived as the act of using words that mean the opposite of what you think, particularly to make funny [2,3]. The latter has been defined as a strange or funny situation because things happen in a way that seems to be the opposite of what you expected [4,5]. Both definitions highlight one of the primary features of this rhetorical device, to make something understandable by expressing the opposite [3]. Such rhetorical complexity makes dialogue that is occasionally arduous to comprehend by humans [6]. This challenge has attracted the research community's attention. In recent years, several approaches have been published addressing the detection of irony in natural language text obtained from different social media sources. In particular, we have focused on identifying whether its author spreads Irony and Stereotypes for the PAN shared challenge [7].

Irony and Stereotype identification is an essential task in social media applications since it enables the identification of online abuse and harassment [8]. The automatic detection in written discourse is a complex task where traditional text mining methods cannot be applied successfully [9]. This conventional method's drawback is that the identification requires semantics that cannot be inferred from word counts computed from document analysis [10]. To overcome this deficiency, more sophisticated Machine Learning methods are applied to solve the problem, but although the obtained results are quite competitive, there is still scope for improvement [11].

In this working note, we have faced the author profiling challenge proposed by constructing a supervised classification pipeline. The method comprises four stages: pre-processing stage to clean the provided dataset; the collecting features a stage, where contextual and non-contextual embeddings were utilized; training stage, where several machine models were used and finally, the evaluation stage, where we evaluated the designed models.

The remainder of this working note is organized as follows: Section 2 provides a brief review of the related work. It examines distinct approaches proposed in the literature that address the challenge thrown. Section 3 specifies the methods developed for addressing the challenge. In Section 4 the results achieved in the challenge are presented. Besides, we report separately our participation in a subtask concerning Stereotype Stance Detection in Section 5. Finally, Section 6 summarizes the findings obtained developing this work, and it also scrutinizes some of the future lines to explore.

Related work

Due to the complexity of recognizing verbal irony in a natural language test, we can find different approaches, from ones that utilize simple strategies to more complex ones. Barbieri and Saggion in [12] proposed two tree-based classifiers, Random Forest and Decision Tree. They represent each tweet by using the following seven groups of features: i) frequency to analyze the gap between rare and common words utilized by users; ii) written-spoken to capture the users' style; iii) intensity to measure the power of the adverbs and adjectives; iv) structure that analyzes the length, punctuation and emoticons; v) sentiments utilize SentiWordNet to measure the gap between positive and negative terms; vi) synonyms for comparing common vs rare synonyms utilized; and, finally, vii) ambiguity to analyze possible ambiguities. Furthermore, this approach explores the usage of a bag of word representation based on frequency analysis. Anchiêta et al. in [13] proposed two differentiated strategies in a more complex way. Firstly, they combined Term Frequency, Inverse Frequency (TF-IDF), and Linear Support Vector Machine (SVM). The former was used to extract the features from the datasets, and the latter was the classifier utilized for the identification task. The classifier was trained by using the Stochastic Gradient Descent (SGD) technique. Secondly, they combine embeddings created by using Distributed Bag of Words Paragraph Vector model and a Multi-Layer Perceptron (MLP) for tackling the classification task. With a different level of complexity, Wu et al. in [14] proposed the Dense-LSTM model based on a densely connected LSTM network with a multi-task learning strategy. It comprises an embedding layer to convert the inputs tweets into a sequence of dense vectors and four Bi-LSTM layers concatenated with 200-dim hidden states to learn different levels of information simultaneously. Furthermore, they combine two different pre-trained word embeddings that are concatenated and used.

Methodology

The IROSTEREO challenge consists of a binary classification from an author profiling perspective. The dataset proposed in this task is compiled from Twitter. The training dataset has a total of 420 different users. The users are grouped in those who are irony and stereotype spreaders (I) and those who are not (NI). For each user, there are 200 of their tweets written in English [15]. We separate a small subset from the training dataset to perform a custom validation. The statistics of the dataset are depicted in Table 1. We followed a typical pipeline of supervised classification for solving the proposed task. We started applying a pre-processing stage of the dataset. Then, we compile the feature sets, train several machine learning models, and evaluate them using a custom validation split.

The pre-processing stage consists in the creation of an alternative version of the documents by encoding them in lowercase, removing mentions, hyperlinks, digits, punctuation, and expressive lengthening. Besides, we expand texting language and fix misspellings. The alternative version is used to extract the majority of the feature sets based on sentence embeddings and linguistic features.

The feature sets involved in our experimentation consist into linguistic features from UMU-TextStats (LF) [16,17], and three sentence embeddings: non-contextual sentence embeddings from FastText (SE) [18], and two contextual embeddings from BERT (BF) [19] and RoBERTa (RF) [20]. These feature sets were used separately and combined using two approaches. One is based on knowledge integration, and another is based in ensemble learning. For the ensemble learning, we evaluate four strategies: i) soft voting, ii) hard voting, iii) average probabilities, and iv) highest probability. Concerning the hard voting strategy, it is a weighted mode with the weights based on the F1-score results of the custom validation split.

As we deal with author analysis, the results are reported at author level. Nevertheless, some of the described stages of our pipeline are performed at document level. For example, the features are compiled at the document level and then combined by each user to produce a unique vector per user.

For extracting the contextual sentence embeddings from BERT and RoBERTa we fine-tune the models with the IROSTEREO dataset, and then we obtained the value of the [CLS] token [21]. In order to find the best hyperparameters, we trained ten models for BERT and 10 models for RoBERTa. The hyperparameters are i) the weight decay, ii) the batch size, iii) the warm-up speed, iv) the number of epochs, and v) the learning rate. This step is performed using Tree of Parzen Estimators (TPE) [22], which is a method for choosing the hyper-parameters based on Bayesian reasoning and expected improvement.

Next, we train several neural networks for each feature set and for the combination of all feature sets using a knowledge integration strategy. These hyperparameters include the shape of the network, the dropout mechanism, the learning rate and the activation function. Table 2 depicts the best hyperparameters for this task. It can be observed that the majority of best results are obtained with shallow neural networks, with two hidden layers but a large number of neurons. The only exception is SE, which achieved its best result with 7 hidden layers and 27 neurons in a long funnel shape. Besides, all experiments achieved better results with high dropout mechanisms and a learning rate of 0.010 using no activation function (linear). The exception again is SE, which uses a smaller learning rate, a smaller ratio of the dropout and elu as an activation function.

Results and analysis

First, we report the results achieved with our custom validation split. These results include the label's precision, recall, and F1-score, and the macro and weighted average of the whole task. We report the results of the best feature set trained separately in Tables 3, 4, 5, and 6 for LF, SE, BF and RF respectively. The results for the KI strategy in Table 7, and the results for the four strategies using ensemble learning in Tables 8, 9, 10, and 11 for hard voting, soft voting, averaging probabilities and highest probability, respectively. From the results achieved with the custom validation split, that are reported at the user level, we can assume that determining if a user is an irony and stereotype spreader is somehow a trivial task. It is worth mentioning that these results at the document level will be more limited. The best results are achieved with BERT from the feature sets separately. However, it draws our attention to the limited results achieved with RoBERTa (see Table 6). We observed that all the incorrect predictions are from the I label, but the model reports the NI label. We also compared the predictions between BF and RF and observed that the BF model outputs probabilities near 100% whereas RF is less accurate.

We can observe that the features based on pure linguistics also achieve similar results to the ones obtained with state-of-the-art embeddings. The LF features include features related to stylometry, lexis, social media jargon, and Part-of-Speech features. In order to gain insights concerning the interpretability of the features, we calculate the Information Gain of the linguistic features and we normalize the top-ten that achieved a better coefficient for the I and NI labels (see Figure 1). It can be observed that the majority of the most discerning features are related to stylometry, including the number of words, the number of words per sentence, the usage of full stops, and some readability formulas. There are two linguistic features concerning morphology: the usage of interjections and the usage of words in singular. Because of the results achieved, for participating in this shared task, we sent one run based on the Knowledge Integration strategy, achieving the fifth best result an accuracy of 96.67% from a total of 65 participants. We selected the Knowledge Integration strategy over the two ensemble learning strategies that achieved the same results (hard and soft voting) because the Knowledge Integration have reported better results in other shared tasks in the past. Table 12 contains the best results along with the baselines proposed by the organizers. It is worth mentioning that these results were yielded from TIRA [23], an Integrated Research Architecture utilized by IROSTEREO organizers for managing the participants' algorithms executions.

Stereotype Stance Detection subtask

The organizers of the IROSTEREO shared task proposed a minor challenge that consisted in determining whether the stereotypes are used in favor of the target or against them. For this, they released a training dataset in which 94 authors were tagged against and 46 authors were tagged in favor.

To solve this challenge, we utilized the same pipeline described for the main challenge. Our results with our custom validation split are promising. We report the Knowledge Integration strategy and the four ensemble learning strategies in Table 13. We achieved a macro F1-score of 82.8753% with the Knowledge Integration strategy and a macro F1-score of 78.5714% with the ensemble learning based on soft-voting.

However, our results with the official leader board were limited. We achieved a macro F1-score of 53.12% (F1 with the In Favor label of 25% and F1 of the Against label of 81.25%).

Conclusions and future work

This working note describes the participation of the UMUTeam at IROSTEREO shared task concerning author profiling. This is a binary classification task in which the participants are challenged to identify which profiles from Twitter are spreaders of Irony and Stereotypes. Our proposal is grounded on the combination of several feature sets based on linguistic features and sentence embeddings. We achieved promising results with our custom validation split and achieved a final accuracy of 96.67% on the official leader board. One of the limitations of our work is the results achieved with RoBERTa (RF). Although we searched for common errors in our pipeline, we could identify the reason for the limited results. To address this issue, we suggest combining document level analysis with tools such as SHAP [24] in order to find the reason for the wrong predictions. Besides, we obtained wrong predictions with the highest probability strategy (see Table 11) as this ensemble outputs always the I label (100% of accuracy). We suspect this issue is related to an error in code while generating the final report.

As future work we will incorporate cross-validation techniques into our pipeline and dataaugmentation techniques to increase our models' generalization.

Figure 1 :1Figure 1: Information gain of the ten features with higher information gain

Table 1 IROSTEREO1dataset's userstrain val totalI166 44 210NI176 34 210TOTAL 342 78 420

Table 22Best hyper-parameters for each feature set trained separately and combined using knowledge integration.Feature set shapehidden layers neurons dropoutlr activationLFbrick2128.3 0.010 linearSElong funnel727.1 0.001 eluBFbrick2512.3 0.010 linearRFbrick2512.3 0.010 linearKIbrick2512.3 0.010 linear

Table 33Classification report for LF.Table 4Classification report for SE.precision recall f1-scoreprecisionrecall f1-scoreI94.737 97.297 96.000I100.000 94.595 97.222NI97.826 95.745 96.774NI95.918 100.000 97.917macro avg96.281 96.521 96.387macro avg97.959 97.297 97.569weighted avg96.465 96.429 96.433weighted avg97.716 97.619 97.611Table 5Table 6Classification report for BF.Classification report for RF.precision recall f1-scoreprecision recall f1-scoreI97.297 97.297 97.297I64.286 48.649 55.385NI97.872 97.872 97.872NI66.071 78.723 71.845macro avg97.585 97.585 97.585macro avg65.179 63.686 63.615weighted avg97.619 97.619 97.619weighted avg65.285 65.476 64.594Table 7Classification report for KI.precision recall f1-scoreI97.297 97.297 97.297NI97.872 97.872 97.872macro avg97.585 97.585 97.585weighted avg97.619 97.619 97.619

Table 88Classification report for EL (hard-voting).precision recall f1-scoreI97.297 97.297 97.297NI97.872 97.872 97.872macro avg97.585 97.585 97.585weighted avg97.619 97.619 97.619

Table 1212Top results and baselines from the official leader board for the IROSTEREO 2022 shared task, ranked by accuracyPOS TeamAccuracy1 wentaoyu0.99442 harshv0.97783 edapal0.97223 ikae0.97225 UMUTEAM0.96675 Enrub0.9667LDSE0.9389RF + char 2-ngrams0.8610LR + word 1-ngrams0.8490LSTM+Bert-encoding0.6940

Table 1313Macro precision, recall and F1-score for the Stance detection subtask using the custom validation split. KI stands for Knowledge Integration and EL for Ensemble Learningprecision recall f1-scoreKI93.478 78.571 82.875EL -soft-voting83.182 76.071 78.571EL -hard-voting91.667 71.429 75.455EL -average probabilities78.804 68.929 71.459EL -highest probability37.037 50.000 42.553

Acknowledgments

This work is part of the research project LaTe4PSP (PID2019-107652RB-I00) funded by MCIN/ AEI/10.13039/501100011033. This work is also part of the research project PDC2021-121112-I00 funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR, and by "Programa para la Recualificación del Sistema Universitario Español 2021-2023". In addition, José Antonio García-Díaz is supported by Banco Santander and the University of Murcia through the Doctorado Industrial programme.

An impact analysis of features in a classification approach to irony detection in product reviews KBuschmeier PCimiano RKlinger Proceedings of the 5th workshop on computational approaches to subjectivity, sentiment and social media analysis the 5th workshop on computational approaches to subjectivity, sentiment and social media analysis 2014 Compilation and evaluation of the spanish saticorpus 2021 for satire identification using linguistic features and transformers JAGarcía-Díaz RValencia-García Complex & Intelligent Systems 2022 Irony JGarmendia 2018 Cambridge University Press JHunter Evaluating the Circumstances, John P. Hunter III 2014 Preparation for Critical Instruction: How to Explain Subject Matter While Teaching All Learners to Think, Read, and Write Critically VPMaiorana 2016 Rowman & Littlefield A Deep Learning Model for Detecting Sarcasm in Written Product Reviews NSchwarz Interactive Media; FH Oberösterreich -Fakultät für informatik Kommunikation und Medien 4232 Master's thesis Overview of pan 2022: Authorship verification, profiling irony and stereotype spreaders, style change detection, and trigger detection JBevendorff BChulvi EFersini AHeini MKestemont KKredens MMayerl ROrtega-Bueno PPęzik MPotthast European Conference on Information Retrieval Springer 2022 What a sunny day: toward emoji sensitive irony detection AChaudhary SAHayati NOtani AWBlack 2019. 2019 212 W-NUT Irony detection on microposts with limited set of features HTaslioglu PKaragoz Proceedings of the Symposium on Applied Computing the Symposium on Applied Computing 2017 Computational irony: A survey and new perspectives BCWallace Artificial intelligence review 43 2015 Masking and bert-based models for stereotype identication JSánchez-Junquera PRosso MMontes BChulvi Procesamiento del Lenguaje Natural 67 2021 Modelling irony in twitter FBarbieri HSaggion Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics 2014 Piln IDPT 2021: Irony detection in portuguese texts with superficial features and embeddings RTAnchiêta FA RNeto JCMarinho KVDo Nascimento RSMoura Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the Spanish Society for Natural Language Processing CEUR Workshop Proceedings the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the Spanish Society for Natural Language Processing

Málaga, Spain

September, 2021. 2943. 2021 Thu_ngn at semeval-2018 task 3: Tweet irony detection with densely connected lstm and multi-task learning CWu FWu SWu JLiu ZYuan YHuang Proceedings of The 12th International Workshop on Semantic Evaluation The 12th International Workshop on Semantic Evaluation 2018 Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) at PAN 2022 O.-BReynier CBerta RFrancisco RPaolo FElisabetta CLEF 2022 Labs and Workshops Notebook Papers 2022 Psychographic traits identification based on political ideology: An author analysis study on spanish politicians' tweets posted in 2020 JAGarcía-Díaz RColomo-Palacios RValencia-García Future Generation Computer Systems 130 2022 Evaluating feature combination strategies for hate-speech detection in spanish using linguistic features and transformers JAGarcía-Díaz SMJiménez-Zafra MAGarcía-Cumbreras RValencia-García Complex & Intelligent Systems 2022 Learning word vectors for 157 languages EGrave PBojanowski PGupta AJoulin TMikolov CoRR abs/1802.06893 2018 Bert: Pre-training of deep bidirectional transformers for language understanding JD M.-WCKenton LKToutanova Proceedings of NAACL-HLT NAACL-HLT 2019 Roberta: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov CoRR abs/1907.11692 2019 Sentence-bert: Sentence embeddings using siamese bert-networks NReimers IGurevych Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics 2019 Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures JBergstra DYamins DCox International conference on machine learning

PMLR

2013 TIRA Integrated Research Architecture MPotthast TGollub MWiegmann BStein 10.1007/978-3-030-22948-1_5 Information Retrieval Evaluation in a Changing World, The Information Retrieval Series NFerro CPeters

Berlin Heidelberg New York

Springer 2019 A unified approach to interpreting model predictions SMLundberg S.-ILee Advances in neural information processing systems 30 2017