Identifying Author Profiles Containing Irony or
Spreading Stereotypes with SBERT and Emojis
Notebook for PAN at CLEF 2022

Narjes Tahaei, Harsh Verma, Parsa Bagherzadeh, Farhood Farahnak, Nadia Sheikh
and Sabine Bergler
Concordia University, Montreal


                                      Abstract
                                      The Profiling Irony and Stereotype Spreaders on Twitter Shared Task at CLEF 2022 asks to analyze a set
                                      of tweets from an author in order to determine whether the author spreads irony and stereotypes. Our
                                      approach is to feed all tweets from an author to SBERT a a single input batch and feed all output vectors
                                      from the model to an additive attention layer to produce a vector representation per author. This vector
                                      representation is the input to a linear binary classifier. We placed second among 40 participants with a
                                      test accuracy of 97.8. We selected the best model using 5 fold cross-validation.

                                      Keywords
                                      SBERT, Tokenization, Emojis,


1. Introduction
Irony is the expression of one’s meaning by using language that normally signifies the opposite
[1]. Stereotype is a fixed, over-generalized belief about a particular group that can propagate
false biases regarding that group [2]. Twitter is a widely used communication platform with
a high percentage of tweets using irony and stereotypes. The Profiling Irony and Stereotype
Spreaders on Twitter Shared Task [3] is to classify an author as someone who spreads irony
and stereotypes from so-called Profiles of 200 tweets for each of 420 authors.
   Current best practice for NLP tasks is to feed sentence input to a BERT-like pre-trained
language model [4] and use the CLS output token as input to a classifier. If this task was
to classify individual tweets as containing ironic or stereotypical content, this would be a
satisfactory approach and would be expected to perform well. But this task calls for the
classification of Profiles with 200 tweets each, with each tweet providing important context.
We thus batch the data with batch size 200 to model the Profiles.
   We fine-tune SBERT [5] to construct individual vector representations for each tweet from
an author followed by an additive attention layer to calculate a vector representation for each
Profile. Profile vectors feed into a linear classifier.


CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ narjesossadat.tahaei@mail.concordia.ca (N. Tahaei); farhood.farahnak@mail.concordia.ca (F. Farahnak);
bergler@cse.concordia.ca (S. Bergler)
 0000-0001-9987-6747 (S. Bergler)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
  We submitted two runs which differ only in the number of epochs, for which the model ran.
narcis ran for six epochs and obtained rank 10 with an accuracy of 0.93; harshv ran for seven
epochs and obtained rank 2 with an accuracy of 0.98 [6]. The results were submitted to the
TIRA platform [7].


2. Related Work
The authors of the top ranked system on SemEval15 Task 11 on Sentiment Analysis of Figurative
Language in Twitter (Task 11) [8, 9] did not make any adaptations to a sentiment analysis pipeline
aside from training on the training data. This suggests that sentiment (which is often inversed
in ironic language) co-occurs with other cues that are strong enough for a normal sentiment
analysis system to succeed.
   In [10], the authors applied supervised machine learning to a set of Twitter corpora that
have been used earlier for the irony and sarcasm detection to identify ironic tweets. Different
groups of features have been used. Structural features used to detect common patterns of the
ironic tweets including length, type of punctuation, and emoticons. Other features used to
capture information about affect, such as semantic lexicons and dictionaries of affective terms.
Affective features outperformed in distinguishing among ironic and non-ironic tweets in all
Twitter corpora.
   The authors in [11] propose a model to detect hate spreaders on Twitter. BERTweet is used to
embed each tweet into a vector representation. Two methods are used to construct an author’s
profile. First, a graph neural network is used to relate information from different posts to
their authors. Second, a sequence of encoded tweets is given to an additive attention-based,
fully connected neural network. The single vector coming from adding all weighted vectors is
used by a classifier. To classify whether an author is irony and stereotype spreader or not a
version of impostor method is introduced. In [12] the Impostors method is used to determine
whether two documents are from the same author. The method checks whether X (from a hate
spreader set), is closer to Y (from non-hate spreader set) than to each one of impostors, which is
a similarity function here. The system proved useful for the hate speech spreaders identification
shared task at PAN 2021. The second method, which is a sequence based Profile modeling used
similar additive attention approach. They first applied the additive attention to obtain a vector
representation for each tweet and then added them together to have a Profile representation. In
contrast, we have all tweets of an author available for fine tuning and it is the additive attention
layer which gives us a representation for a Profile.


3. Data
The training dataset consists of 420 XML files, corresponding to authors, and each XML file
contains 200 tweets from its author. The dataset contains anonymized URLs, hashtags, and user
mentions. The URLs, hashtags, and user mentions are replaced with tags like #URL#, which are
frequently repeated at the beginning or end of tweets. These repetitions are retained.
   Tweets also frequently contain emojis, in fact many tweets contain many different and even
repetitions of the same emojis. Emojis are retained.
4. Model
For comparison, we feed the data to two classifiers, a SVM classifier using authors’ TF-IDF
vectors and a fine-tuned SBERT model.
   For the SVM baseline, we construct TF-IDF vectors for all Profiles. Tokens that occur fewer
than 30 times in the whole corpus were removed from the vocabulary. The final size of the
vocabulary is 4128. The SVM baseline yields 86% accuracy on our evaluation set.
   Our main model is based on SBERT [5], a modification of BERT that is fine-tuned for sentence
representations on NLI data to produce semantically meaningful sentence embeddings that
can be compared using cosine-similarity. SBERT adds a pooling layer on top of BERT’s 12
transformer layers, which produces a fixed size sentence embedding. For fine-tuning there
is a siamese and triplet networks to update weights. During training process of finding the
most similar sentence pairs, each sentence in a pair are inputs to the networks, which outputs
two sentence embeddings. Then, the similarity of these embeddings is computed using cosine
similarity between vectors. The resulting sentence embeddings can be used for different tasks.
   SBERT used mean pooling for the semantic similarity task [5]. We experimented with
combining the CLS token output with mean pooling, max pooling, or additive attention. For
mean pooling and max pooling, we compute the mean and maximum of all 200 CLS outputs for
a Profile. The additive attention on the 200 CLS outputs outperformed the other two methods.
   Each training sample corresponds to a Profile and comprises 200 tweets. We use SBERT and
input tweets with a batch size of 200:

                     𝐻 = [ℎ𝐶𝐿𝑆1 ; . . . ; ℎ𝐶𝐿𝑆200 ] = SBERT(𝑇 𝑤𝑒𝑒𝑡1 , . . . , 𝑇 𝑤𝑒𝑒𝑡200 )     (1)
where 𝐻 = [ℎ𝐶𝐿𝑆1 ; . . . ; ℎ𝐶𝐿𝑆200 ] is the row-level concatenation of the [CLS] representations
of the 200 Profile tweets. To obtain a single representation for the authors’ profile we then use
additive attention:

                                           ℎ = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊𝑎𝑡𝑡 𝐻 𝑇 )𝐻                           (2)
where 𝑊𝑎𝑡𝑡 ∈ R1×384 is a learnable parameter. The additive attention assigns importance
weights to tweets and provides a weighted sum of the [CLS] representation. The profile
representation ℎ is then used for the final classification:

                                              𝑝 = 𝑠𝑜𝑡𝑚𝑎𝑥(ℎ𝑊 + 𝑏)                              (3)
where 𝑊 ∈ R384×2 is a linear transformation that characterizes the classifier and 𝑝 ∈ R2
represents the class probabilities.1 We calculate loss using Cross-Entropy loss and optimize the
network using the Adam [13] optimizer with 𝑙𝑟 = 5𝑒 − 6.

4.1. Input representations
There are two ways to input the data into the model. One way is to assign the Profile’s label to
all 200 of its tweets and classify individual tweets. When we have assigned predictions to all

   1
       Note that the task is a binary classification problem.
tweets of a profile, they have to be aggregated, for instance with a majority vote or an additive
attention layer to obtain Profile labels.
   We chose instead to represent Profiles in batches of size 200, putting all tweets from one
Profile into a single batch. The Profile vector is obtained using an attention layer over the CLS
tokens of all tweets in a batch. This high-level author encoding vector is used as input to the
final layer for the classification.
   We use the batch method for our system. During training, cost function and gradient descent
are calculated for Profiles only. This is intended to avoid the noise created by over-assigning
the Profile label to tweets that do not contribute to its assignment2 and produces better gradient
descent and finally a richer final representation for a Profile.


5. Model Architecture
The BERT tokenizer3 prepares the input for the model. It splits input into sub-word token
strings, converts tokens to their ids, adds new tokens to the vocabulary, manages special tokens,
pads, and truncates vectors. For each input batch, we first called the tokenizer along with the
model. It turns out that the computational overhead of the subword tokenizer restricts the
model to run properly on the large batch size. This was addressed by transferring the tokenizer
outside the model.
   We used PyTorch4 and the SBERT architecture implemented by Huggingface5 , and ran models
on one GPU. The longest tweet has over 600 tokens, which made it impossible to give all Profile
tweets as one batch to the SBERT model. Since only 25 tweets have between 200 to about 600
tokens, we truncated tweets to a maximal length of 200 tokens and set padding to True. Thus
the model has the dimensions of 200 tweets * 200 tokens * 384 hidden layers (SBERT’s default).

5.1. Feature Exploration
5.1.1. Emojis
In many texts, specifically in the context of short texts like tweets, emojis serve as a proxy for
the emotional contents of the text [14]. When using irony, authors may add emojis to the text
to ensure that the double entendre does not go unnoticed. Emojis both, text descriptions and
UNICODE values6 . We choose UNICODE values, to keep emojis distinct from text.
   In order to use emojis in SBERT, we add their UNICODEs to the SBERT vocabulary with
random weight vectors.


   2
     An irony spreader may have a majority of tweets that are neutral, factual, and objective.
   3
     https://huggingface.co/transformers/v3.3.1/main𝑐 𝑙𝑎𝑠𝑠𝑒𝑠/𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒𝑟.ℎ𝑡𝑚𝑙
   4
     https://pytorch.org/
   5
     https://huggingface.co/transformers/v3.3.1/index.html
   6
     https://unicode.org/emoji/charts/full-emoji-list.html
Table 1
F1 scores for positive class and the accuracy of predicting irony and stereotype profiles on validation
data using BERT and SBERT models. Maximum length of tokens is 100.
                                       Model     accuracy        f1
                                       BERT               87     87
                                       SBERT              94     95


Table 2
Effect of adding emojis to the vocabulary. The f1 score and accuracy of predicting irony and stereotype
Profiles on validation data based on SBERT models from 5 fold cross-validation on epoch 6.
                             Model                 accuracy            f1    std
                            SBERT with emojis             92.3    93.2       2.99
                            SBERT                         91.6    91.6       3.49


6. Experiments
We used 5-fold cross validation during training for fine-tuning and to select the best performing
model. The splits were fixed by setting random state values to integer values. Adding emojis to
the SBERT vocabulary improved performance during development and was retained for the
submitted system.
   Table 2 shows the results of feeding all tweets from a user as a batch to BERT and SBERT. For
this table, we truncated the maximum length of tokenized tweet to 100, due to the much higher
dimension of hidden units in BERT which is 768. We limited the maximum length of tokenized
tweet in SBERT to 100 as well. The result confirms that SBERT outperforms the BERT model.
Table 2 shows that adding emojis slightly increases accuracy and most interestingly, reduces
standard deviation.


7. Results
Table 3 shows our two competition runs, of the same model. narcis ran for six epochs and
obtained rank 10 with an accuracy of 0.93; harshv ran for seven epochs and obtained rank 2
with an accuracy of 0.98.

Table 3
Competition results
                                   submission    epoch         accuracy
                                   narcis             6               0.93
                                   harshv             7               0.98
7.1. Analysis
To explore the features of the most important tweets for detecting authors who use irony, we
selected the top 15 tweets according to attention scores.
   We counted the number of hashtags in the 15 tweets with highest attention scores across
ironic and non ironic Profiles. Although all hashtags were mapped to HASHTAG, they are
still good indicators for ironic profiles. There are authors in both groups who have multiple
hashtags in their top ranked tweets, but ironic authors used them more frequently compared to
others.
   We also find that authors of ironic Profiles write more short tweets than non-ironic authors.


8. Conclusion
We focused on trying to preserve the context of an entire Profile and chose to encode Profiles
as single batches of size 200 with additive attention. In experiments on the training data,
this performed better than the alternative classification of individual tweets. We also saw an
improvement in performance when adding all emojis to the language model vocabulary. And
finally, SBERT, trained for sentence similarity, yielded better performance than BERT on the
training data. The high performance of our system (run for 7 epochs: rank 2) indicates that the
training data foreshadows the test data well. The performance of our system run for 6 epochs
and rank 10 shows that subtle parameters can be more influential than system design.


References
 [1] K. Buschmeier, P. Cimiano, R. Klinger, An impact analysis of features in a classification
     approach to irony detection in product reviews, in: Proceedings of the 5th Workshop
     on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis,
     Association for Computational Linguistics, Baltimore, Maryland, 2014.
 [2] R. Pujari, E. Oveson, P. Kulkarni, E. Nouri, Reinforcement guided multi-task learning
     framework for low-resource stereotype detection, in: Proceedings of the 60th Annual Meet-
     ing of the Association for Computational Linguistics (Volume 1: Long Papers), Association
     for Computational Linguistics, Dublin, Ireland, 2022.
 [3] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
     R. Ortega-Bueno, P. Pezik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wieg-
     mann, M. Wolska, E. Zangerle, Overview of PAN 2022: Authorship Verification, Profiling
     Irony and Stereotype Spreaders, and Style Change Detection, in: M. D. E. F. S. C. M. G. P. A.
     H. M. P. G. F. N. F. Alberto Barron-Cedeno, Giovanni Da San Martino (Ed.), Experimental
     IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth
     International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture
     Notes in Computer Science, Springer, 2022.
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019.
 [5] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-
     networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint Conference on Natural Language
     Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
     China, 2019.
 [6] O.-B. Reynier, C. Berta, R. Francisco, R. Paolo, F. Elisabetta, Profiling Irony and Stereotype
     Spreaders on Twitter (IROSTEREO) at PAN 2022, in: CLEF 2022 Labs and Workshops,
     Notebook Papers, CEUR-WS.org, 2022.
 [7] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019.
 [8] C. Özdemir, S. Bergler, A comparative study of different sentiment lexica for sentiment
     analysis of tweets, in: Proceedings of the International Conference on Recent Advances in
     Natural Language Processing (RANLP 2015), 2015.
 [9] C. Özdemir, S. Bergler, CLaC-SentiPipe: SemEval2015 Subtasks 10 b,e, and Task 11, in:
     Proceedings of SemEval 2015 at NAACL/HLT, 2015.
[10] D. I. H. Farías, V. Patti, P. Rosso, Irony detection in twitter: The role of affective content,
     ACM Trans. Internet Technol. 16 (2016).
[11] R. Labadie Tamayo, D. Castro Castro, R. Ortega Bueno, Deep Modeling of Latent Repre-
     sentations for Twitter Profiles on Hate Speech Spreaders Identification Task—Notebook
     for PAN at CLEF 2021, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF
     2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[12] S. Seidman, Authorship verification using the impostors method, in: Notebook for PAN at
     CLEF 2013, 2013.
[13] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR abs/1412.6980
     (2015).
[14] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, S. Lehmann, Using millions of emoji occur-
     rences to learn any-domain representations for detecting sentiment, emotion and sarcasm,
     in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Pro-
     cessing, Association for Computational Linguistics, Copenhagen, Denmark, 2017.