=Paper=
{{Paper
|id=Vol-3180/paper-196
|storemode=property
|title=UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining
Linguistic Features with Transformers
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-196.pdf
|volume=Vol-3180
|authors=José Antonio García Díaz,Miguel Ángel Rodríguez-García,Francisco García-Sánchez,Rafael Valencia-Garcia
|dblpUrl=https://dblp.org/rec/conf/clef/Garcia-DiazR0V22
}}
==UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining
Linguistic Features with Transformers==
UMUTeam at IROSTEREO: Profiling Irony and
Stereotype spreaders on Twitter combining Linguistic
Features with Transformers
José Antonio García-Díaz1 , Miguel Ángel Rodríguez-García2 ,
Francisco García-Sánchez1 and Rafael Valencia-García1
1
Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100, Spain
2
Departamento de Ciencias de la Computación, Universidad Rey Juan Carlos, 28933 Madrid, Spain
Abstract
Irony is a curious mode of communication in which the speaker says something that wants the audience
to be interpreted oppositely. Its automatic detection is a very challenging task due to its complex
interpretation, and it has a significant potential for various applications in text mining. Social Media
platforms like Twitter offer a vital chance to analyze this literary technique since users frequently utilize
it to give their opinions. In this working note, we describe the contribution designed for the PAN’s
shared author profiling task and its subtask concerning Stereotype Stance Detection. The former consists
in determining whether the authors spread irony and stereotypes and the latter is focused on identifying
stereotypes that can hurt vulnerable groups. The organizers provide a set compiled from Twitter to carry
out the task. In particular, we have proposed a supervised learning pipeline consisting of a combination
of Deep Learning techniques that utilizes context and non-context embeddings to address the binary
classification. The resulting system reaches promising results, achieving the fifth-best score in the main
task with an accuracy of 96.67%.
Keywords
Author Profiling, Irony and Stereotypes, Stance detection, Feature Engineering, Deep Learning, Trans-
formers, Natural Language Processing
1. Introduction
With the proliferation of social media, irony has made it one of the most literary device utilized
in this communication manner [1]. Several definitions have been provided in the literature
about irony, but they concurred with the same binary classification, verbal and situational
irony. The former has been conceived as the act of using words that mean the opposite of
what you think, particularly to make funny [2, 3]. The latter has been defined as a strange or
funny situation because things happen in a way that seems to be the opposite of what you
expected [4, 5]. Both definitions highlight one of the primary features of this rhetorical device,
to make something understandable by expressing the opposite [3]. Such rhetorical complexity
CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ joseantonio.garcia8@um.es (J. A. García-Díaz); miguel.rodriguez@urjc.es (M. Á. Rodríguez-García);
frgarcia@um.es (F. García-Sánchez); valencia@um.es (R. Valencia-García)
0000-0002-3651-2660 (J. A. García-Díaz); 0000-0001-6244-653 (M. Á. Rodríguez-García); 0000-0003-2667-5359
(F. García-Sánchez); 0000-0003-2457-1791 (R. Valencia-García)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
makes dialogue that is occasionally arduous to comprehend by humans [6]. This challenge has
attracted the research community’s attention. In recent years, several approaches have been
published addressing the detection of irony in natural language text obtained from different
social media sources. In particular, we have focused on identifying whether its author spreads
Irony and Stereotypes for the PAN shared challenge [7].
Irony and Stereotype identification is an essential task in social media applications since it
enables the identification of online abuse and harassment [8]. The automatic detection in written
discourse is a complex task where traditional text mining methods cannot be applied successfully
[9]. This conventional method’s drawback is that the identification requires semantics that
cannot be inferred from word counts computed from document analysis [10]. To overcome this
deficiency, more sophisticated Machine Learning methods are applied to solve the problem, but
although the obtained results are quite competitive, there is still scope for improvement [11].
In this working note, we have faced the author profiling challenge proposed by constructing
a supervised classification pipeline. The method comprises four stages: pre-processing stage to
clean the provided dataset; the collecting features a stage, where contextual and non-contextual
embeddings were utilized; training stage, where several machine models were used and finally,
the evaluation stage, where we evaluated the designed models.
The remainder of this working note is organized as follows: Section 2 provides a brief review
of the related work. It examines distinct approaches proposed in the literature that address the
challenge thrown. Section 3 specifies the methods developed for addressing the challenge. In
Section 4 the results achieved in the challenge are presented. Besides, we report separately our
participation in a subtask concerning Stereotype Stance Detection in Section 5. Finally, Section
6 summarizes the findings obtained developing this work, and it also scrutinizes some of the
future lines to explore.
2. Related work
Due to the complexity of recognizing verbal irony in a natural language test, we can find
different approaches, from ones that utilize simple strategies to more complex ones. Barbieri
and Saggion in [12] proposed two tree-based classifiers, Random Forest and Decision Tree. They
represent each tweet by using the following seven groups of features: i) frequency to analyze
the gap between rare and common words utilized by users; ii) written-spoken to capture the
users’ style; iii) intensity to measure the power of the adverbs and adjectives; iv) structure that
analyzes the length, punctuation and emoticons; v) sentiments utilize SentiWordNet to measure
the gap between positive and negative terms; vi) synonyms for comparing common vs rare
synonyms utilized; and, finally, vii) ambiguity to analyze possible ambiguities. Furthermore,
this approach explores the usage of a bag of word representation based on frequency analysis.
Anchiêta et al. in [13] proposed two differentiated strategies in a more complex way. Firstly, they
combined Term Frequency, Inverse Frequency (TF–IDF), and Linear Support Vector Machine
(SVM). The former was used to extract the features from the datasets, and the latter was the
classifier utilized for the identification task. The classifier was trained by using the Stochastic
Gradient Descent (SGD) technique. Secondly, they combine embeddings created by using
Distributed Bag of Words Paragraph Vector model and a Multi-Layer Perceptron (MLP) for
tackling the classification task. With a different level of complexity, Wu et al. in [14] proposed
the Dense-LSTM model based on a densely connected LSTM network with a multi-task learning
strategy. It comprises an embedding layer to convert the inputs tweets into a sequence of dense
vectors and four Bi-LSTM layers concatenated with 200-dim hidden states to learn different
levels of information simultaneously. Furthermore, they combine two different pre-trained
word embeddings that are concatenated and used.
3. Methodology
The IROSTEREO challenge consists of a binary classification from an author profiling perspective.
The dataset proposed in this task is compiled from Twitter. The training dataset has a total of
420 different users. The users are grouped in those who are irony and stereotype spreaders
(I) and those who are not (NI). For each user, there are 200 of their tweets written in English
[15]. We separate a small subset from the training dataset to perform a custom validation. The
statistics of the dataset are depicted in Table 1.
Table 1
IROSTEREO dataset’s users
train val total
I 166 44 210
NI 176 34 210
TOTAL 342 78 420
We followed a typical pipeline of supervised classification for solving the proposed task. We
started applying a pre-processing stage of the dataset. Then, we compile the feature sets, train
several machine learning models, and evaluate them using a custom validation split.
The pre-processing stage consists in the creation of an alternative version of the documents by
encoding them in lowercase, removing mentions, hyperlinks, digits, punctuation, and expressive
lengthening. Besides, we expand texting language and fix misspellings. The alternative version
is used to extract the majority of the feature sets based on sentence embeddings and linguistic
features.
The feature sets involved in our experimentation consist into linguistic features from UMU-
TextStats (LF) [16, 17], and three sentence embeddings: non-contextual sentence embeddings
from FastText (SE) [18], and two contextual embeddings from BERT (BF) [19] and RoBERTa
(RF) [20]. These feature sets were used separately and combined using two approaches. One is
based on knowledge integration, and another is based in ensemble learning. For the ensemble
learning, we evaluate four strategies: i) soft voting, ii) hard voting, iii) average probabilities,
and iv) highest probability. Concerning the hard voting strategy, it is a weighted mode with the
weights based on the F1-score results of the custom validation split.
As we deal with author analysis, the results are reported at author level. Nevertheless, some of
the described stages of our pipeline are performed at document level. For example, the features
are compiled at the document level and then combined by each user to produce a unique vector
per user.
For extracting the contextual sentence embeddings from BERT and RoBERTa we fine-tune
the models with the IROSTEREO dataset, and then we obtained the value of the [CLS] token
[21]. In order to find the best hyperparameters, we trained ten models for BERT and 10 models
for RoBERTa. The hyperparameters are i) the weight decay, ii) the batch size, iii) the warm-up
speed, iv) the number of epochs, and v) the learning rate. This step is performed using Tree of
Parzen Estimators (TPE) [22], which is a method for choosing the hyper-parameters based on
Bayesian reasoning and expected improvement.
Next, we train several neural networks for each feature set and for the combination of all
feature sets using a knowledge integration strategy. These hyperparameters include the shape
of the network, the dropout mechanism, the learning rate and the activation function. Table
2 depicts the best hyperparameters for this task. It can be observed that the majority of best
results are obtained with shallow neural networks, with two hidden layers but a large number
of neurons. The only exception is SE, which achieved its best result with 7 hidden layers and
27 neurons in a long funnel shape. Besides, all experiments achieved better results with high
dropout mechanisms and a learning rate of 0.010 using no activation function (linear). The
exception again is SE, which uses a smaller learning rate, a smaller ratio of the dropout and elu
as an activation function.
Table 2
Best hyper-parameters for each feature set trained separately and combined using knowledge integration.
Feature set shape hidden layers neurons dropout lr activation
LF brick 2 128 .3 0.010 linear
SE long funnel 7 27 .1 0.001 elu
BF brick 2 512 .3 0.010 linear
RF brick 2 512 .3 0.010 linear
KI brick 2 512 .3 0.010 linear
4. Results and analysis
First, we report the results achieved with our custom validation split. These results include the
label’s precision, recall, and F1-score, and the macro and weighted average of the whole task.
We report the results of the best feature set trained separately in Tables 3, 4, 5, and 6 for LF,
SE, BF and RF respectively. The results for the KI strategy in Table 7, and the results for the
four strategies using ensemble learning in Tables 8, 9, 10, and 11 for hard voting, soft voting,
averaging probabilities and highest probability, respectively.
From the results achieved with the custom validation split, that are reported at the user level,
we can assume that determining if a user is an irony and stereotype spreader is somehow a
trivial task. It is worth mentioning that these results at the document level will be more limited.
The best results are achieved with BERT from the feature sets separately. However, it draws our
attention to the limited results achieved with RoBERTa (see Table 6). We observed that all the
incorrect predictions are from the I label, but the model reports the NI label. We also compared
the predictions between BF and RF and observed that the BF model outputs probabilities near
100% whereas RF is less accurate.
We can observe that the features based on pure linguistics also achieve similar results to the
ones obtained with state-of-the-art embeddings. The LF features include features related to
Table 3 Table 4
Classification report for LF. Classification report for SE.
precision recall f1-score precision recall f1-score
I 94.737 97.297 96.000 I 100.000 94.595 97.222
NI 97.826 95.745 96.774 NI 95.918 100.000 97.917
macro avg 96.281 96.521 96.387 macro avg 97.959 97.297 97.569
weighted avg 96.465 96.429 96.433 weighted avg 97.716 97.619 97.611
Table 5 Table 6
Classification report for BF. Classification report for RF.
precision recall f1-score precision recall f1-score
I 97.297 97.297 97.297 I 64.286 48.649 55.385
NI 97.872 97.872 97.872 NI 66.071 78.723 71.845
macro avg 97.585 97.585 97.585 macro avg 65.179 63.686 63.615
weighted avg 97.619 97.619 97.619 weighted avg 65.285 65.476 64.594
Table 7 Table 8
Classification report for KI. Classification report for EL (hard-voting).
precision recall f1-score precision recall f1-score
I 97.297 97.297 97.297 I 97.297 97.297 97.297
NI 97.872 97.872 97.872 NI 97.872 97.872 97.872
macro avg 97.585 97.585 97.585 macro avg 97.585 97.585 97.585
weighted avg 97.619 97.619 97.619 weighted avg 97.619 97.619 97.619
Table 9 Table 10
Classification report for EL (soft-voting). Classification report for EL (avg. probabilities).
precision recall f1-score support precision recall f1-score
I 97.297 97.297 97.297 I 97.222 94.595 95.890
NI 97.872 97.872 97.872 NI 95.833 97.872 96.842
macro avg 97.585 97.585 97.585 macro avg 96.528 96.233 96.366
weighted avg 97.619 97.619 97.619 weighted avg 96.445 96.429 96.423
Table 11
Classification report for EL (highest probabily).
precision recall f1-score
I 44.048 100.000 61.157
NI 0.000 0.000 0.000
macro avg 22.024 50.000 30.579
weighted avg 19.402 44.048 26.938
stylometry, lexis, social media jargon, and Part-of-Speech features. In order to gain insights
concerning the interpretability of the features, we calculate the Information Gain of the linguistic
features and we normalize the top-ten that achieved a better coefficient for the I and NI labels
(see Figure 1). It can be observed that the majority of the most discerning features are related to
stylometry, including the number of words, the number of words per sentence, the usage of full
stops, and some readability formulas. There are two linguistic features concerning morphology:
the usage of interjections and the usage of words in singular.
I NI
(STY) word-count
(MOR) interjections
(STY) word-four-letter
(STY) fulltops
(STY) words per sentence
(STY) words in uppercase
(STY) inflesz
(MOR) number in singular
(STY) word-six-letter
(LEX) others
0% 25% 50% 75% 100%
Figure 1: Information gain of the ten features with higher information gain
Because of the results achieved, for participating in this shared task, we sent one run based
on the Knowledge Integration strategy, achieving the fifth best result an accuracy of 96.67%
from a total of 65 participants. We selected the Knowledge Integration strategy over the two
ensemble learning strategies that achieved the same results (hard and soft voting) because the
Knowledge Integration have reported better results in other shared tasks in the past. Table
12 contains the best results along with the baselines proposed by the organizers. It is worth
mentioning that these results were yielded from TIRA [23], an Integrated Research Architecture
utilized by IROSTEREO organizers for managing the participants’ algorithms executions.
5. Stereotype Stance Detection subtask
The organizers of the IROSTEREO shared task proposed a minor challenge that consisted in
determining whether the stereotypes are used in favor of the target or against them. For this,
they released a training dataset in which 94 authors were tagged against and 46 authors were
tagged in favor.
To solve this challenge, we utilized the same pipeline described for the main challenge. Our
results with our custom validation split are promising. We report the Knowledge Integration
strategy and the four ensemble learning strategies in Table 13. We achieved a macro F1-score of
82.8753% with the Knowledge Integration strategy and a macro F1-score of 78.5714% with the
ensemble learning based on soft-voting.
However, our results with the official leader board were limited. We achieved a macro F1-score
of 53.12% (F1 with the In Favor label of 25% and F1 of the Against label of 81.25%).
6. Conclusions and future work
This working note describes the participation of the UMUTeam at IROSTEREO shared task
concerning author profiling. This is a binary classification task in which the participants are
Table 12
Top results and baselines from the official leader board for the IROSTEREO 2022 shared task, ranked by
accuracy
POS Team Accuracy
1 wentaoyu 0.9944
2 harshv 0.9778
3 edapal 0.9722
3 ikae 0.9722
5 UMUTEAM 0.9667
5 Enrub 0.9667
LDSE 0.9389
RF + char 2-ngrams 0.8610
LR + word 1-ngrams 0.8490
LSTM+Bert-encoding 0.6940
Table 13
Macro precision, recall and F1-score for the Stance detection subtask using the custom validation split.
KI stands for Knowledge Integration and EL for Ensemble Learning
precision recall f1-score
KI 93.478 78.571 82.875
EL - soft-voting 83.182 76.071 78.571
EL - hard-voting 91.667 71.429 75.455
EL - average probabilities 78.804 68.929 71.459
EL - highest probability 37.037 50.000 42.553
challenged to identify which profiles from Twitter are spreaders of Irony and Stereotypes. Our
proposal is grounded on the combination of several feature sets based on linguistic features
and sentence embeddings. We achieved promising results with our custom validation split and
achieved a final accuracy of 96.67% on the official leader board.
One of the limitations of our work is the results achieved with RoBERTa (RF). Although
we searched for common errors in our pipeline, we could identify the reason for the limited
results. To address this issue, we suggest combining document level analysis with tools such
as SHAP [24] in order to find the reason for the wrong predictions. Besides, we obtained
wrong predictions with the highest probability strategy (see Table 11) as this ensemble outputs
always the I label (100% of accuracy). We suspect this issue is related to an error in code while
generating the final report.
As future work we will incorporate cross-validation techniques into our pipeline and data-
augmentation techniques to increase our models’ generalization.
Acknowledgments
This work is part of the research project LaTe4PSP (PID2019-107652RB-I00) funded by MCIN/
AEI/10.13039/501100011033. This work is also part of the research project PDC2021-121112-I00
funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR,
and by “Programa para la Recualificación del Sistema Universitario Español 2021-2023”. In
addition, José Antonio García-Díaz is supported by Banco Santander and the University of
Murcia through the Doctorado Industrial programme.
References
[1] K. Buschmeier, P. Cimiano, R. Klinger, An impact analysis of features in a classification
approach to irony detection in product reviews, in: Proceedings of the 5th workshop on
computational approaches to subjectivity, sentiment and social media analysis, 2014, pp.
42–49.
[2] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus
2021 for satire identification using linguistic features and transformers, Complex &
Intelligent Systems (2022) 1–14.
[3] J. Garmendia, Irony, Cambridge University Press, 2018.
[4] J. Hunter, Evaluating the Circumstances, John P. Hunter III, 2014. URL: https://books.
google.es/books?id=w7sYBAAAQBAJ.
[5] V. P. Maiorana, Preparation for Critical Instruction: How to Explain Subject Matter While
Teaching All Learners to Think, Read, and Write Critically, Rowman & Littlefield, 2016.
[6] N. Schwarz, A Deep Learning Model for Detecting Sarcasm in Written Product Reviews,
Master’s thesis, Interactive Media; FH Oberösterreich – Fakultät für informatik, Kommu-
nikation und Medien, 4232 Hagenberg, Austria, 2019.
[7] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
R. Ortega-Bueno, P. Pęzik, M. Potthast, et al., Overview of pan 2022: Authorship ver-
ification, profiling irony and stereotype spreaders, style change detection, and trigger
detection, in: European Conference on Information Retrieval, Springer, 2022, pp. 331–338.
[8] A. Chaudhary, S. A. Hayati, N. Otani, A. W. Black, What a sunny day: toward emoji
sensitive irony detection, W-NUT 2019 (2019) 212.
[9] H. Taslioglu, P. Karagoz, Irony detection on microposts with limited set of features, in:
Proceedings of the Symposium on Applied Computing, 2017, pp. 1076–1081.
[10] B. C. Wallace, Computational irony: A survey and new perspectives, Artificial intelligence
review 43 (2015) 467–483.
[11] J. Sánchez-Junquera, P. Rosso, M. Montes, B. Chulvi, et al., Masking and bert-based models
for stereotype identication, Procesamiento del Lenguaje Natural 67 (2021) 83–94.
[12] F. Barbieri, H. Saggion, Modelling irony in twitter, in: Proceedings of the Student
Research Workshop at the 14th Conference of the European Chapter of the Association
for Computational Linguistics, 2014, pp. 56–64.
[13] R. T. Anchiêta, F. A. R. Neto, J. C. Marinho, K. V. do Nascimento, R. S. Moura, Piln IDPT
2021: Irony detection in portuguese texts with superficial features and embeddings, in:
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the
Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII
International Conference of the Spanish Society for Natural Language Processing., Málaga,
Spain, September, 2021, volume 2943 of CEUR Workshop Proceedings, CEUR-WS.org, 2021,
pp. 917–924.
[14] C. Wu, F. Wu, S. Wu, J. Liu, Z. Yuan, Y. Huang, Thu_ngn at semeval-2018 task 3: Tweet
irony detection with densely connected lstm and multi-task learning, in: Proceedings of
The 12th International Workshop on Semantic Evaluation, 2018, pp. 51–56.
[15] O.-B. Reynier, C. Berta, R. Francisco, R. Paolo, F. Elisabetta, Profiling Irony and Stereotype
Spreaders on Twitter (IROSTEREO) at PAN 2022, in: CLEF 2022 Labs and Workshops,
Notebook Papers, CEUR-WS.org, 2022.
[16] J. A. García-Díaz, R. Colomo-Palacios, R. Valencia-García, Psychographic traits identifica-
tion based on political ideology: An author analysis study on spanish politicians’ tweets
posted in 2020, Future Generation Computer Systems 130 (2022) 59–74.
[17] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, R. Valencia-García, Evalu-
ating feature combination strategies for hate-speech detection in spanish using linguistic
features and transformers, Complex & Intelligent Systems (2022) 1–22.
[18] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors
for 157 languages, CoRR abs/1802.06893 (2018). URL: http://arxiv.org/abs/1802.06893.
arXiv:1802.06893.
[19] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[21] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process-
ing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[22] J. Bergstra, D. Yamins, D. Cox, Making a science of model search: Hyperparameter opti-
mization in hundreds of dimensions for vision architectures, in: International conference
on machine learning, PMLR, 2013, pp. 115–123.
[23] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
978-3-030-22948-1\_5.
[24] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances
in neural information processing systems 30 (2017).