=Paper=
{{Paper
|id=Vol-3180/paper-196
|storemode=property
|title=UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining
                        Linguistic Features with Transformers
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-196.pdf
|volume=Vol-3180
|authors=José Antonio García Díaz,Miguel Ángel Rodríguez-García,Francisco García-Sánchez,Rafael Valencia-Garcia
|dblpUrl=https://dblp.org/rec/conf/clef/Garcia-DiazR0V22
}}
==UMUTeam at IROSTEREO: Profiling Irony and Stereotype spreaders on Twitter combining
                        Linguistic Features with Transformers==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-196.pdf</pdf>
<pre>
UMUTeam at IROSTEREO: Profiling Irony and
Stereotype spreaders on Twitter combining Linguistic
Features with Transformers
José Antonio García-Díaz1 , Miguel Ángel Rodríguez-García2 ,
Francisco García-Sánchez1 and Rafael Valencia-García1
1
    Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100, Spain
2
    Departamento de Ciencias de la Computación, Universidad Rey Juan Carlos, 28933 Madrid, Spain


                                         Abstract
                                         Irony is a curious mode of communication in which the speaker says something that wants the audience
                                         to be interpreted oppositely. Its automatic detection is a very challenging task due to its complex
                                         interpretation, and it has a significant potential for various applications in text mining. Social Media
                                         platforms like Twitter offer a vital chance to analyze this literary technique since users frequently utilize
                                         it to give their opinions. In this working note, we describe the contribution designed for the PAN’s
                                         shared author profiling task and its subtask concerning Stereotype Stance Detection. The former consists
                                         in determining whether the authors spread irony and stereotypes and the latter is focused on identifying
                                         stereotypes that can hurt vulnerable groups. The organizers provide a set compiled from Twitter to carry
                                         out the task. In particular, we have proposed a supervised learning pipeline consisting of a combination
                                         of Deep Learning techniques that utilizes context and non-context embeddings to address the binary
                                         classification. The resulting system reaches promising results, achieving the fifth-best score in the main
                                         task with an accuracy of 96.67%.

                                         Keywords
                                         Author Profiling, Irony and Stereotypes, Stance detection, Feature Engineering, Deep Learning, Trans-
                                         formers, Natural Language Processing


1. Introduction
With the proliferation of social media, irony has made it one of the most literary device utilized
in this communication manner [1]. Several definitions have been provided in the literature
about irony, but they concurred with the same binary classification, verbal and situational
irony. The former has been conceived as the act of using words that mean the opposite of
what you think, particularly to make funny [2, 3]. The latter has been defined as a strange or
funny situation because things happen in a way that seems to be the opposite of what you
expected [4, 5]. Both definitions highlight one of the primary features of this rhetorical device,
to make something understandable by expressing the opposite [3]. Such rhetorical complexity

CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ joseantonio.garcia8@um.es (J. A. García-Díaz); miguel.rodriguez@urjc.es (M. Á. Rodríguez-García);
frgarcia@um.es (F. García-Sánchez); valencia@um.es (R. Valencia-García)
 0000-0002-3651-2660 (J. A. García-Díaz); 0000-0001-6244-653 (M. Á. Rodríguez-García); 0000-0003-2667-5359
(F. García-Sánchez); 0000-0003-2457-1791 (R. Valencia-García)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
makes dialogue that is occasionally arduous to comprehend by humans [6]. This challenge has
attracted the research community’s attention. In recent years, several approaches have been
published addressing the detection of irony in natural language text obtained from different
social media sources. In particular, we have focused on identifying whether its author spreads
Irony and Stereotypes for the PAN shared challenge [7].
   Irony and Stereotype identification is an essential task in social media applications since it
enables the identification of online abuse and harassment [8]. The automatic detection in written
discourse is a complex task where traditional text mining methods cannot be applied successfully
[9]. This conventional method’s drawback is that the identification requires semantics that
cannot be inferred from word counts computed from document analysis [10]. To overcome this
deficiency, more sophisticated Machine Learning methods are applied to solve the problem, but
although the obtained results are quite competitive, there is still scope for improvement [11].
   In this working note, we have faced the author profiling challenge proposed by constructing
a supervised classification pipeline. The method comprises four stages: pre-processing stage to
clean the provided dataset; the collecting features a stage, where contextual and non-contextual
embeddings were utilized; training stage, where several machine models were used and finally,
the evaluation stage, where we evaluated the designed models.
   The remainder of this working note is organized as follows: Section 2 provides a brief review
of the related work. It examines distinct approaches proposed in the literature that address the
challenge thrown. Section 3 specifies the methods developed for addressing the challenge. In
Section 4 the results achieved in the challenge are presented. Besides, we report separately our
participation in a subtask concerning Stereotype Stance Detection in Section 5. Finally, Section
6 summarizes the findings obtained developing this work, and it also scrutinizes some of the
future lines to explore.


2. Related work
Due to the complexity of recognizing verbal irony in a natural language test, we can find
different approaches, from ones that utilize simple strategies to more complex ones. Barbieri
and Saggion in [12] proposed two tree-based classifiers, Random Forest and Decision Tree. They
represent each tweet by using the following seven groups of features: i) frequency to analyze
the gap between rare and common words utilized by users; ii) written-spoken to capture the
users’ style; iii) intensity to measure the power of the adverbs and adjectives; iv) structure that
analyzes the length, punctuation and emoticons; v) sentiments utilize SentiWordNet to measure
the gap between positive and negative terms; vi) synonyms for comparing common vs rare
synonyms utilized; and, finally, vii) ambiguity to analyze possible ambiguities. Furthermore,
this approach explores the usage of a bag of word representation based on frequency analysis.
Anchiêta et al. in [13] proposed two differentiated strategies in a more complex way. Firstly, they
combined Term Frequency, Inverse Frequency (TF–IDF), and Linear Support Vector Machine
(SVM). The former was used to extract the features from the datasets, and the latter was the
classifier utilized for the identification task. The classifier was trained by using the Stochastic
Gradient Descent (SGD) technique. Secondly, they combine embeddings created by using
Distributed Bag of Words Paragraph Vector model and a Multi-Layer Perceptron (MLP) for
tackling the classification task. With a different level of complexity, Wu et al. in [14] proposed
the Dense-LSTM model based on a densely connected LSTM network with a multi-task learning
strategy. It comprises an embedding layer to convert the inputs tweets into a sequence of dense
vectors and four Bi-LSTM layers concatenated with 200-dim hidden states to learn different
levels of information simultaneously. Furthermore, they combine two different pre-trained
word embeddings that are concatenated and used.


3. Methodology
The IROSTEREO challenge consists of a binary classification from an author profiling perspective.
The dataset proposed in this task is compiled from Twitter. The training dataset has a total of
420 different users. The users are grouped in those who are irony and stereotype spreaders
(I) and those who are not (NI). For each user, there are 200 of their tweets written in English
[15]. We separate a small subset from the training dataset to perform a custom validation. The
statistics of the dataset are depicted in Table 1.

Table 1
IROSTEREO dataset’s users
                                             train val total
                                     I         166   44   210
                                     NI        176   34   210
                                     TOTAL     342   78   420

   We followed a typical pipeline of supervised classification for solving the proposed task. We
started applying a pre-processing stage of the dataset. Then, we compile the feature sets, train
several machine learning models, and evaluate them using a custom validation split.
   The pre-processing stage consists in the creation of an alternative version of the documents by
encoding them in lowercase, removing mentions, hyperlinks, digits, punctuation, and expressive
lengthening. Besides, we expand texting language and fix misspellings. The alternative version
is used to extract the majority of the feature sets based on sentence embeddings and linguistic
features.
   The feature sets involved in our experimentation consist into linguistic features from UMU-
TextStats (LF) [16, 17], and three sentence embeddings: non-contextual sentence embeddings
from FastText (SE) [18], and two contextual embeddings from BERT (BF) [19] and RoBERTa
(RF) [20]. These feature sets were used separately and combined using two approaches. One is
based on knowledge integration, and another is based in ensemble learning. For the ensemble
learning, we evaluate four strategies: i) soft voting, ii) hard voting, iii) average probabilities,
and iv) highest probability. Concerning the hard voting strategy, it is a weighted mode with the
weights based on the F1-score results of the custom validation split.
   As we deal with author analysis, the results are reported at author level. Nevertheless, some of
the described stages of our pipeline are performed at document level. For example, the features
are compiled at the document level and then combined by each user to produce a unique vector
per user.
   For extracting the contextual sentence embeddings from BERT and RoBERTa we fine-tune
the models with the IROSTEREO dataset, and then we obtained the value of the [CLS] token
[21]. In order to find the best hyperparameters, we trained ten models for BERT and 10 models
for RoBERTa. The hyperparameters are i) the weight decay, ii) the batch size, iii) the warm-up
speed, iv) the number of epochs, and v) the learning rate. This step is performed using Tree of
Parzen Estimators (TPE) [22], which is a method for choosing the hyper-parameters based on
Bayesian reasoning and expected improvement.
   Next, we train several neural networks for each feature set and for the combination of all
feature sets using a knowledge integration strategy. These hyperparameters include the shape
of the network, the dropout mechanism, the learning rate and the activation function. Table
2 depicts the best hyperparameters for this task. It can be observed that the majority of best
results are obtained with shallow neural networks, with two hidden layers but a large number
of neurons. The only exception is SE, which achieved its best result with 7 hidden layers and
27 neurons in a long funnel shape. Besides, all experiments achieved better results with high
dropout mechanisms and a learning rate of 0.010 using no activation function (linear). The
exception again is SE, which uses a smaller learning rate, a smaller ratio of the dropout and elu
as an activation function.

Table 2
Best hyper-parameters for each feature set trained separately and combined using knowledge integration.
           Feature set shape          hidden layers neurons dropout           lr activation
           LF           brick                     2       128         .3   0.010   linear
           SE           long funnel               7        27         .1   0.001   elu
           BF           brick                     2       512         .3   0.010   linear
           RF           brick                     2       512         .3   0.010   linear
           KI           brick                     2       512         .3   0.010   linear


4. Results and analysis
First, we report the results achieved with our custom validation split. These results include the
label’s precision, recall, and F1-score, and the macro and weighted average of the whole task.
We report the results of the best feature set trained separately in Tables 3, 4, 5, and 6 for LF,
SE, BF and RF respectively. The results for the KI strategy in Table 7, and the results for the
four strategies using ensemble learning in Tables 8, 9, 10, and 11 for hard voting, soft voting,
averaging probabilities and highest probability, respectively.
   From the results achieved with the custom validation split, that are reported at the user level,
we can assume that determining if a user is an irony and stereotype spreader is somehow a
trivial task. It is worth mentioning that these results at the document level will be more limited.
The best results are achieved with BERT from the feature sets separately. However, it draws our
attention to the limited results achieved with RoBERTa (see Table 6). We observed that all the
incorrect predictions are from the I label, but the model reports the NI label. We also compared
the predictions between BF and RF and observed that the BF model outputs probabilities near
100% whereas RF is less accurate.
  We can observe that the features based on pure linguistics also achieve similar results to the
ones obtained with state-of-the-art embeddings. The LF features include features related to


Table 3                                                       Table 4
Classification report for LF.                                 Classification report for SE.
                 precision      recall f1-score                                precision       recall f1-score
 I                  94.737      97.297   96.000                I                  100.000 94.595        97.222
 NI                 97.826      95.745   96.774                NI                  95.918 100.000       97.917
 macro avg          96.281      96.521   96.387                macro avg           97.959 97.297        97.569
 weighted avg       96.465      96.429   96.433                weighted avg        97.716 97.619        97.611

Table 5                                                       Table 6
Classification report for BF.                                 Classification report for RF.
                 precision      recall f1-score                                precision      recall f1-score
 I                  97.297      97.297   97.297                I                    64.286   48.649   55.385
 NI                 97.872      97.872   97.872                NI                   66.071   78.723   71.845
 macro avg          97.585      97.585   97.585                macro avg            65.179   63.686   63.615
 weighted avg       97.619      97.619   97.619                weighted avg         65.285   65.476   64.594

Table 7                                                       Table 8
Classification report for KI.                                 Classification report for EL (hard-voting).
                 precision      recall f1-score                                precision      recall f1-score
 I                  97.297      97.297   97.297                I                    97.297   97.297   97.297
 NI                 97.872      97.872   97.872                NI                   97.872   97.872   97.872
 macro avg          97.585      97.585   97.585                macro avg            97.585   97.585   97.585
 weighted avg       97.619      97.619   97.619                weighted avg         97.619   97.619   97.619

Table 9                                                       Table 10
Classification report for EL (soft-voting).                   Classification report for EL (avg. probabilities).
                 precision      recall f1-score support                        precision      recall f1-score
 I                  97.297      97.297   97.297                I                    97.222   94.595   95.890
 NI                 97.872      97.872   97.872                NI                   95.833   97.872   96.842
 macro avg          97.585      97.585   97.585                macro avg            96.528   96.233   96.366
 weighted avg       97.619      97.619   97.619                weighted avg         96.445   96.429   96.423

                                Table 11
                                Classification report for EL (highest probabily).
                                                  precision    recall f1-score
                                 I                  44.048 100.000       61.157
                                 NI                  0.000   0.000        0.000
                                 macro avg          22.024 50.000        30.579
                                 weighted avg       19.402 44.048        26.938
stylometry, lexis, social media jargon, and Part-of-Speech features. In order to gain insights
concerning the interpretability of the features, we calculate the Information Gain of the linguistic
features and we normalize the top-ten that achieved a better coefficient for the I and NI labels
(see Figure 1). It can be observed that the majority of the most discerning features are related to
stylometry, including the number of words, the number of words per sentence, the usage of full
stops, and some readability formulas. There are two linguistic features concerning morphology:
the usage of interjections and the usage of words in singular.


                                                   I   NI


            (STY) word-count


          (MOR) interjections


       (STY) word-four-letter


               (STY) fulltops


   (STY) words per sentence


    (STY) words in uppercase


                (STY) inflesz


    (MOR) number in singular


         (STY) word-six-letter


                 (LEX) others

                                0%           25%              50%             75%              100%


Figure 1: Information gain of the ten features with higher information gain


   Because of the results achieved, for participating in this shared task, we sent one run based
on the Knowledge Integration strategy, achieving the fifth best result an accuracy of 96.67%
from a total of 65 participants. We selected the Knowledge Integration strategy over the two
ensemble learning strategies that achieved the same results (hard and soft voting) because the
Knowledge Integration have reported better results in other shared tasks in the past. Table
12 contains the best results along with the baselines proposed by the organizers. It is worth
mentioning that these results were yielded from TIRA [23], an Integrated Research Architecture
utilized by IROSTEREO organizers for managing the participants’ algorithms executions.
5. Stereotype Stance Detection subtask
The organizers of the IROSTEREO shared task proposed a minor challenge that consisted in
determining whether the stereotypes are used in favor of the target or against them. For this,
they released a training dataset in which 94 authors were tagged against and 46 authors were
tagged in favor.
   To solve this challenge, we utilized the same pipeline described for the main challenge. Our
results with our custom validation split are promising. We report the Knowledge Integration
strategy and the four ensemble learning strategies in Table 13. We achieved a macro F1-score of
82.8753% with the Knowledge Integration strategy and a macro F1-score of 78.5714% with the
ensemble learning based on soft-voting.
   However, our results with the official leader board were limited. We achieved a macro F1-score
of 53.12% (F1 with the In Favor label of 25% and F1 of the Against label of 81.25%).


6. Conclusions and future work
This working note describes the participation of the UMUTeam at IROSTEREO shared task
concerning author profiling. This is a binary classification task in which the participants are


Table 12
Top results and baselines from the official leader board for the IROSTEREO 2022 shared task, ranked by
accuracy
                                POS Team                         Accuracy
                                   1   wentaoyu                     0.9944
                                   2   harshv                       0.9778
                                   3   edapal                       0.9722
                                   3   ikae                         0.9722
                                   5   UMUTEAM                     0.9667
                                   5   Enrub                        0.9667
                                       LDSE                         0.9389
                                       RF + char 2-ngrams           0.8610
                                       LR + word 1-ngrams           0.8490
                                       LSTM+Bert-encoding           0.6940


Table 13
Macro precision, recall and F1-score for the Stance detection subtask using the custom validation split.
KI stands for Knowledge Integration and EL for Ensemble Learning
                                                     precision    recall f1-score
                        KI                             93.478    78.571      82.875
                        EL - soft-voting               83.182    76.071      78.571
                        EL - hard-voting               91.667    71.429      75.455
                        EL - average probabilities     78.804    68.929      71.459
                        EL - highest probability       37.037    50.000      42.553
challenged to identify which profiles from Twitter are spreaders of Irony and Stereotypes. Our
proposal is grounded on the combination of several feature sets based on linguistic features
and sentence embeddings. We achieved promising results with our custom validation split and
achieved a final accuracy of 96.67% on the official leader board.
  One of the limitations of our work is the results achieved with RoBERTa (RF). Although
we searched for common errors in our pipeline, we could identify the reason for the limited
results. To address this issue, we suggest combining document level analysis with tools such
as SHAP [24] in order to find the reason for the wrong predictions. Besides, we obtained
wrong predictions with the highest probability strategy (see Table 11) as this ensemble outputs
always the I label (100% of accuracy). We suspect this issue is related to an error in code while
generating the final report.
  As future work we will incorporate cross-validation techniques into our pipeline and data-
augmentation techniques to increase our models’ generalization.


Acknowledgments
This work is part of the research project LaTe4PSP (PID2019-107652RB-I00) funded by MCIN/
AEI/10.13039/501100011033. This work is also part of the research project PDC2021-121112-I00
funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR,
and by “Programa para la Recualificación del Sistema Universitario Español 2021-2023”. In
addition, José Antonio García-Díaz is supported by Banco Santander and the University of
Murcia through the Doctorado Industrial programme.


References
 [1] K. Buschmeier, P. Cimiano, R. Klinger, An impact analysis of features in a classification
     approach to irony detection in product reviews, in: Proceedings of the 5th workshop on
     computational approaches to subjectivity, sentiment and social media analysis, 2014, pp.
     42–49.
 [2] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus
     2021 for satire identification using linguistic features and transformers, Complex &
     Intelligent Systems (2022) 1–14.
 [3] J. Garmendia, Irony, Cambridge University Press, 2018.
 [4] J. Hunter, Evaluating the Circumstances, John P. Hunter III, 2014. URL: https://books.
     google.es/books?id=w7sYBAAAQBAJ.
 [5] V. P. Maiorana, Preparation for Critical Instruction: How to Explain Subject Matter While
     Teaching All Learners to Think, Read, and Write Critically, Rowman & Littlefield, 2016.
 [6] N. Schwarz, A Deep Learning Model for Detecting Sarcasm in Written Product Reviews,
     Master’s thesis, Interactive Media; FH Oberösterreich – Fakultät für informatik, Kommu-
     nikation und Medien, 4232 Hagenberg, Austria, 2019.
 [7] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
     R. Ortega-Bueno, P. Pęzik, M. Potthast, et al., Overview of pan 2022: Authorship ver-
     ification, profiling irony and stereotype spreaders, style change detection, and trigger
     detection, in: European Conference on Information Retrieval, Springer, 2022, pp. 331–338.
 [8] A. Chaudhary, S. A. Hayati, N. Otani, A. W. Black, What a sunny day: toward emoji
     sensitive irony detection, W-NUT 2019 (2019) 212.
 [9] H. Taslioglu, P. Karagoz, Irony detection on microposts with limited set of features, in:
     Proceedings of the Symposium on Applied Computing, 2017, pp. 1076–1081.
[10] B. C. Wallace, Computational irony: A survey and new perspectives, Artificial intelligence
     review 43 (2015) 467–483.
[11] J. Sánchez-Junquera, P. Rosso, M. Montes, B. Chulvi, et al., Masking and bert-based models
     for stereotype identication, Procesamiento del Lenguaje Natural 67 (2021) 83–94.
[12] F. Barbieri, H. Saggion, Modelling irony in twitter, in: Proceedings of the Student
     Research Workshop at the 14th Conference of the European Chapter of the Association
     for Computational Linguistics, 2014, pp. 56–64.
[13] R. T. Anchiêta, F. A. R. Neto, J. C. Marinho, K. V. do Nascimento, R. S. Moura, Piln IDPT
     2021: Irony detection in portuguese texts with superficial features and embeddings, in:
     Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the
     Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII
     International Conference of the Spanish Society for Natural Language Processing., Málaga,
     Spain, September, 2021, volume 2943 of CEUR Workshop Proceedings, CEUR-WS.org, 2021,
     pp. 917–924.
[14] C. Wu, F. Wu, S. Wu, J. Liu, Z. Yuan, Y. Huang, Thu_ngn at semeval-2018 task 3: Tweet
     irony detection with densely connected lstm and multi-task learning, in: Proceedings of
     The 12th International Workshop on Semantic Evaluation, 2018, pp. 51–56.
[15] O.-B. Reynier, C. Berta, R. Francisco, R. Paolo, F. Elisabetta, Profiling Irony and Stereotype
     Spreaders on Twitter (IROSTEREO) at PAN 2022, in: CLEF 2022 Labs and Workshops,
     Notebook Papers, CEUR-WS.org, 2022.
[16] J. A. García-Díaz, R. Colomo-Palacios, R. Valencia-García, Psychographic traits identifica-
     tion based on political ideology: An author analysis study on spanish politicians’ tweets
     posted in 2020, Future Generation Computer Systems 130 (2022) 59–74.
[17] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, R. Valencia-García, Evalu-
     ating feature combination strategies for hate-speech detection in spanish using linguistic
     features and transformers, Complex & Intelligent Systems (2022) 1–22.
[18] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors
     for 157 languages, CoRR abs/1802.06893 (2018). URL: http://arxiv.org/abs/1802.06893.
     arXiv:1802.06893.
[19] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, in: Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
     (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[21] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process-
     ing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[22] J. Bergstra, D. Yamins, D. Cox, Making a science of model search: Hyperparameter opti-
     mization in hundreds of dimensions for vision architectures, in: International conference
     on machine learning, PMLR, 2013, pp. 115–123.
[23] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.
[24] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances
     in neural information processing systems 30 (2017).

</pre>