CISUC at IDPT2021: Traditional and Deep
     Learning for Irony Detection in Portuguese

 Hugo Gonçalo Oliveira[0000−0002−5779−8645] , José Pereira[0000−0001−8649−9006] ,
                             and Guilherme Cruz

                   CISUC, Department of Informatics Engineering,
                      University of Coimbra, Coimbra, Portugal
               hroliv@dei.uc.pt, {jose,gjcruz}@student.dei.uc.pt


        Abstract. These notes describe the participation of the CISUC team in
        the IDPT 2021 shared task. Irony detection was tackled as a text clas-
        sification task, where both traditional and transformer-based (BERT)
        approaches were explored. The former performed ok, but not everything
        went well, and the results achieved by BERT were not evaluated, due to
        an issue with our official submissions. Nevertheless, we still discuss some
        of the options taken, identify important features, and present validation
        results in the training data.

        Keywords: Irony Detection · Portuguese · Text Classification · Trans-
        formers · Logistic Regression


1     Introduction

Irony is a rhetorical device where interpretation should not be literal [18], because
its meaning diverges significantly from, and is often the opposite [7], of the
intended meaning. Irony detection is a subtask of Natural Language Processing
aiming at the automatic classification of texts as ironic or not, and is extremely
relevant for tasks like Sentiment Analysis and Opinion Mining [14]. But irony
detection can be challenging, even for humans, who often rely on visual clues,
like facial expression or tone [7], for recognising irony. This is especially true
when irony is expressed through text only, despite studies on identifying textual
clues for irony detection [1].
    Irony detection has been tackled by several Natural Language Process-
ing (NLP) researchers, who adopted different approaches. In 2018, there was
a SemEval task on Irony Detection in English Tweets [18] that covered the
binary classification of tweets as ironic or not. Best systems adopted a deep
learning approach, e.g., a densely LSTM neural network, based on pre-trained
static word embeddings, with syntactic and sentiment features [19]. But there
were also more traditional approaches, e.g., an ensemble classifier with Logistic
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
Regression (LR) and a Support Vector Machine (SVM), considering pretrained
word and emoji embeddings, as well as handcrafted sentiment and word-based
features [13]. Since then, as it happened for other NLP tasks, pre-trained lan-
guage models such as BERT [3], or variations like RoBERTa [8], were exploited
for obtaining contextualized embeddings, which can be combined with a classi-
fier, e.g, a recurrent Convolutional Neural Network with a LSTM layer [11].
    This paper describes the participation of a team from the Center of Informat-
ics and Systems of the University of Coimbra (CISUC) in the Irony Detection in
Portuguese (IDPT) task [2], included in the 2021 edition of the Iberian Languages
Evaluation Forum (IberLEF). This was the first time we tackled irony detection,
but our interest follows previous work on text classification of Portuguese text,
specifically emotions [4] and humour [6].
    Given that annotated datasets were made available by IDPT’s organisation,
one with tweets and another with news, we tackled IDPT with a supervised ma-
chine learning approach. Classifiers were learned from the training data and used
for classifying the test data, then submitted to be evaluated. Yet, our first step
was to look at the data, in order to increase our sensibility with this domain.
In the process, we noted some patterns and learned about the sources of the
training data, which lead to a data cleaning process, described in Section 2. Fol-
lowing this, we decided to explore both traditional text classification approaches
as well as a more recent deep learning approach. The former required us to set
some parameters, including the number of features, but it also enabled us to
analyse and learn about the most important features for irony, at least in the
provided datasets. For this approach, Section 3 provides some insights on the
previous process, including the inclusion of lexicon features. The same section
describes the deep learning approach, based on the popular transformer-based
architecture BERT [3]. For both approaches, we present the results of validation
in the training data.
    Before concluding, Section 4 has a brief discussion on the official results of
the selected classifiers in IDPT. Unfortunately, due to due to our own mistake in
the submission process, classifications by the BERT classifiers were not properly
evaluated, which made it impossible for us to know their real performance in the
test dataset. On the other hand, the performance of the traditional approach,
based on Logistic Regression, was good enough for an approach that could be
seen as a baseline. This conclusion is mostly based on the results in the news
dataset. From the performance in the tweets dataset it is hard to make conclu-
sions. Even though the majority of tweets was automatically classified as ironic,
according to the evaluation metrics, more than a half was not. Apparently, this
issue was common to all participants.


2   Data

Our starting point was the data provided by IDPT’s organisation, namely 15,213
tweets and 18,495 news documents, labelled as ironic (1) or non-ironic (0), which
we used for training our models. Test data comprised 300 unlabelled tweets and
300 news documents. For evaluation purposes, test data had to be submitted
with automatically-assigned labels.
    While analysing the aforementioned datasets, we immediately noticed what
could be a discrepancy between training and test data for tweets. Unlike the
test data, training data contained little to no emojis, hashtags (#), user men-
tions (@), URLs, as well as no line breaks. Having in mind that this could have a
negative impact on the classification task, and that some of those features could
be relevant for irony detection, we tried to understand the differences.
    During this process, we learned about the criteria adopted in the creation of
the training data, after reading some of the references provided by the organisa-
tion [5, 15]. Specifically, we found the dataset created in the scope of da Silva’s
BSc thesis [15], which seemed to cover most of the tweets of the training data.
However, this dataset was available1 in a slightly different format, where some of
aforementioned missing items were either directly present in the textual content
or could be recovered from additional properties.
    One of such properties was the tweet ID, which enabled us to retrieve most of
the original tweets through Twitter’s API. With this, we confirmed the hashtag-
based criteria adopted for automatically-labelling the dataset:

 – Ironic tweets were those containing the hashtags #ironia or #sarcasmo;
 – Non-Ironic tweets were those containing #economia, #politica or
   #educaç~ao.

Based on da Silva’s thesis, we made our own pre-processing of the dataset, which
included: the complete removal of all five hashtags above; the normalisation of
user mentions and URLs, respectively replaced by @user and @link2 . Table 1
illustrates this with tweets in the training dataset, provided by the organisa-
tion, the original tweets as published on Twitter, and the result after our pre-
processing. Differences towards the provided datasets was the inclusion of emojis,
the complete removal of hashtags used for non-ironic tweets (e.g., #economia),
as well as the normalisations.


3     Approaches

This section describes both approaches adopted in our participation in the IDPT
task, a traditional machine learning approach, which could be seen as a baseline,
and a deep learning approach based on BERT. Moreover, for each approach,
validation results are presented and, for the traditional approach, we take a look
at important features considered for detecting irony.
1
    https://github.com/fabio-ricardo/deteccao-ironia
2
    We later noticed that using the ‘@’ character was not the best option, because some
    tokenizers split it from the following word. However, the impact for Portuguese
    should still be minimal, because these words are in English.
      Training Bravo
      Original Bravo      @JuanManSantos @PoliciaColombia @ELTIEMPO
               @elheraldoco @RevistaSemana @NoticiasRCN @NoticiasCaracol
               #ironia
 Pre-processed Bravo      @user @user @user @user @user @user @user @link
      Training 5 MOTIVOS PARA AMAR O BRASIL: Old Corrupç~  ao Brasil
      Original 5 MOTIVOS PARA AMAR O BRASIL: http://ow.ly/9vZX307MOdv
               #Old #Corrupç~
                             ao #Ironia #Brasil
 Pre-processed 5 MOTIVOS PARA AMAR O BRASIL: @link #Old #Corrupç~  ao
               #Brasil
      Training economia KKKKKKKKKKKKK to rindo mas de nervoso rs
      Original #economia KKKKKKKKKKKKK to rindo mas de nervoso rs
 Pre-processed KKKKKKKKKKKKK to rindo mas de nervoso rs
                   Table 1. Example of tweets pre-processing.


3.1   Traditional Approach
Different traditional machine learning classifiers, implemented in the Python
library scikit-learn [10], were trained and validated in different splits of the
training data. For this purpose, documents were represented by TF-IDF vec-
tors, also resorting to scikit-learn’s TfidfTransformer. Portuguese stopwords in
the NLTK list were ignored in this process and different parameters were tested,
namely the n-gram range, maximum document frequency, minimum document
frequency, and maximum number of features. While experimenting in the train-
ing datasets, we decided on setting:
 – N-gram range to 1 (unigrams), as we saw no improvements with bigrams;
 – The maximum document frequency to 0.5, meaning that tokens occurring
   in more than half of the documents in the collection were ignored, for not
   being discriminant enough;
 – The minimum document frequency to 3, meaning that tokens occurring in
   only one or two documents were ignored, for not being frequent enough.
We also tested different values for the maximum number of features.

Cross-Validation Tables 2 and 3 report on the performance of three differ-
ent classifiers in a 10-fold cross-validation, respectively in the tweets and news
training datasets, using different numbers of features (500, 1,500 and 5,000).
The three classifiers used were Logistic Regression (LR), Naive Bayes (NB), and
Random Forest (RF), all white-box, and the metrics considered were: Balanced
Accuracy (BAcc), for being the official measure of IDPT; Precision, Recall, and
F1 score (F1).
    Achieved performances are interesting for a baseline. As expected, perfor-
mance is slightly higher for news, which should be more formal, than for tweets,
where several conventions are broken. But top F1 scores of 89% and 97% may
actually suggest that irony detection is not that hard, especially in formal text.
        Classifier Features       BAcc Precision        Recall         F1
            LR          500 0.82±0.03 0.92±0.01 0.92±0.01 0.92±0.01
                       1500 0.84±0.02 0.94±0.01 0.94±0.01 0.94±0.01
                       5000 0.80±0.03 0.93±0.01 0.93±0.01 0.93±0.01
           NB           500 0.84±0.03 0.93±0.01 0.93±0.01 0.93±0.01
                       1500 0.88±0.02 0.95±0.01 0.95±0.01 0.95±0.01
                       5000 0.89±0.02 0.95±0.01 0.95±0.01 0.95±0.01
            RF          500 0.84±0.02 0.92±0.01 0.92±0.01 0.92±0.01
                       1500 0.87±0.02 0.93±0.01 0.93±0.01 0.93±0.01
                       5000 0.87±0.02 0.94±0.01 0.94±0.01 0.94±0.01
    Table 2. Performance in 10-fold cross validation in the Tweets training set.


        Classifier Features       BAcc Precision        Recall         F1
            LR          500 0.95±0.00 0.95±0.00 0.95±0.00 0.95±0.00
                       1500 0.96±0.00 0.97±0.00 0.97±0.00 0.97±0.00
                       5000 0.97±0.00 0.97±0.00 0.97±0.00 0.97±0.00
            NB          500 0.92±0.01 0.92±0.01 0.92±0.01 0.92±0.01
                       1500 0.94±0.01 0.94±0.01 0.94±0.01 0.94±0.01
                       5000 0.96±0.00 0.96±0.00 0.96±0.00 0.96±0.00
            RF          500 0.94±0.01 0.94±0.01 0.94±0.01 0.94±0.01
                       1500 0.95±0.01 0.95±0.01 0.95±0.01 0.95±0.01
                       5000 0.96±0.01 0.96±0.00 0.96±0.00 0.96±0.00
     Table 3. Performance in 10-fold cross validation in the News training set.


However, these performances are achieved with 5,000 features, which probably
leads to models that are over-fitted to the training dataset. Therefore, having
in mind that the documents in the test datasets were extracted for a different
time period than the training, and there could be significant vocabulary differ-
ences (e.g., due to different trending topics), we decided to consider not more
than 1,500 features for our submission. Validation performances suggest that a
lower number of features has a bigger impact for NB and that, on the other
hand, LR is less affected. In fact, for tweets, LR performs better with 1,500 than
with 5,000 features. Adding to the simplicity of LR and to its best performance
in the news dataset, we decided to use LR in our official IDPT runs.


Lexicon-based features We further decided to explore additional features
that we thought could be useful for irony detection, namely:

 – Concreteness and imageability scores, obtained from the Minho Word Pool
   norms [16], where 3,800 Portuguese words have averages of such properties,
   from 1 to 7, assigned by several judges;
 – Sentiment and emotion features, acquired from the NRC Emotion lexicon [9],
   where such features (0 or 1) are assigned to 14,182 English words through
   crowdsourcing, then translated to other languages, including Portuguese.
This resulted in ten extra features, averaged for each document: Concreteness,
Imageability, Positive, Negative, Anger, Anticipation, Disgust, Fear, Joy, Trust.
Our intuition is that these could complement the TF-IDF features, because, in-
directly, they end up covering a larger vocabulary, more focused and independent
of the training data, and may thus lead to less over-fitting. Since the entries in
the previous lexicons are all lemmas, for computing these features, documents
were first lemmatized, using the Portuguese models of the Stanza [12] package.
    Table 4 shows the performance of the LR classifier using only the extra
features or adding them to the 1,500 TF-IDF features. When used alone, their
impact is irrelevant for the Tweets, but they seem to make a difference for the
News. Alone, they achieve a F1 of 0.71, but when together with TF-IDF, F1
drops by 1 point.


         Classifier     Features      BAcc Precision        Recall         F1
          Tweets            Extra 0.50±0.00 0.82±0.00 0.82±0.00 0.82±0.00
                   TF-IDF+Extra 0.82±0.02 0.92±0.01 0.92±0.01 0.92±0.01
          News              Extra 0.68±0.01 0.71±0.01 0.71±0.01 0.71±0.01
                   TF-IDF+Extra 0.95±0.00 0.96±0.00 0.96±0.00 0.96±0.00
      Table 4. Performance of LR in 10-fold cross validation in the training sets.


   Our option for including these features anyway is further supported by an
analysis of their importance coefficient in an LR classifier that learned from them
only, and of their values in documents of different classes, especially in the News
dataset. For instance, ironic news express slightly more joy, negativity, disgust
and anticipation, and are also more imagetic.

Feature Importance After training a LR classifier, each feature has an im-
portance coefficient, which can be useful for interpretation. Tables 5 show the
most important features when the previous classifier is trained in the training
datasets, as well as the number of documents where they occur and the propor-
tion classified as ironic. Some interesting insights can be observed. For instance,
most tweets with user mentions (normalised as ‘@user’) are ironic, and so are
more “extreme” tweets that use words like ‘adoro’, ‘tudo’ or ‘nada’. As for the
news, many relevant features for irony are names of politicians, suggesting that
they are common targets of irony, or were during the time-span the data was
collected. Other features include words that typically appear before a citation,
namely ‘disse’ and ‘explicou’.

3.2    Transformer-Based Approach
BERT [3] is a transformer-based model widely used in Natural Language Pro-
cessing since its release, by Google. It is pretrained in two general language tasks,
masked language modelling and next sentence prediction, but can be fine-tuned
for other tasks, including text classification, which is our case.
   Feature Docs Ironic                     Feature Docs Ironic
         user 3,118 91.4%                       disse 3,522 87.7%
          pra 617 1.1%                         temer 936 95.3%
      ironı́a 261 100%                         dilma 1,334 95.6%
           tá 277 99.6%                    explicou 948 98.3%
      adoro 157 100%                           cunha 460 95.0%
        tudo 231 97.4%                          aécio 535 97.2%
       nada 227 99.6%                          agora 1,637 78.6%
           pq 210 100%                           hoje 1,887 78.7%
Table 5. Most relevant features in the Tweets (left) and News (right) training datasets.


Fine-tuning Our starting point was BERTimbau [17], i.e., BERT Base Por-
tuguese Cased (BERT-PT), a model with 110M parameters, pretrained by Neu-
ralmind, exclusively for (Brazilian) Portuguese. In order to fine-tune this model
for irony classification, we used the BertForSequenceClassification class of the
Transformers library3 , which adds a classification head on top of BERT. Param-
eters for this model were empirically selected, namely: batch size of 16 for the
tweets and 8 for the news, due to memory limitations; and Adam optimizer4 for
being the common option, with lr=2e−5 and eps=1e−8.

Text Size We quickly came across a limitation on the text size, i.e., some
documents were longer than the maximum number of tokens that BERT could
handle (510 word pieces, plus the initial [CLS] and the final [SEP] tokens). Fig-
ures 1 show the distribution of documents according to the number of tokens in
both training sets. As expected, this is much more frequent in the news dataset,
as news articles tend to be longer than tweets. Still, after careful analysis and
deliberation, we assumed that the proportion of documents that exceeded the
limit of tokens was insignificant and deemed that their absence would not pro-
duce a noticeable change in the model’s overall performance. This left us with
two choices: remove the longer documents from the dataset or truncate them.
We chose the latter for several reasons. The first 510 tokens of each document
would still be relevant for irony detection and, this way, the classifier would learn
from all data. Moreover, documents in the test dataset could also exceed this
limit and we could not simply remove them.

Validation results In order to select the aforementioned parameters, the
BERT-based classifier was validated in the training dataset. For this, we used
60% of the data for training, 10% for validation and 30% for test. Table 6 sum-
marises the performance of the best models of each kind.
    Validation performances achieved with BERT are very high in both datasets
and outperform the already high F1 of the best traditional approaches, confirm-
ing that BERT is a very powerful model. Additional experiments were performed
3
    See https://huggingface.co/transformers/model doc/bert.html
4
    See https://huggingface.co/transformers/main classes/optimizer schedules.html
Fig. 1. Histograms with number of tokens in the tweets (left) and news (right) train-
ing datasets.

                          Data BAcc Precision Recall F1
                        Tweets 0.97       0.99     1.00 0.99
                          News 0.98       0.99     0.99 0.99
          Table 6. Performance of BERT-based classifiers in the training sets.


with balanced versions of the datasets, obtained with undersampling, but they
did not lead to further improvements. We should, nevertheless, recall that these
results can be too over-fitted to the training data.


4     Results

Despite all the experiments performed with BERT and our positive expecta-
tions regarding their high performance, due to a mistake in our submission5 ,
it is impossible for us to take any conclusion on the real performance of the
BERT-based models and on its comparison to other approaches, including our
traditional approach. At least until the labels of the test data are not revealed.
    As for the traditional approach, official results are in Table 7. We recall that
the model used is based in LR, with 1,500 token features plus 10 lexicon-based
features. It achieved sixth position overall in the tweets dataset, but it was not
much different from other participants, nor from our BERT submissions where
the labels were shuffled. In fact, when analysing our labels, we note that the
majority of the tweets were classified as ironic, which results in a recall close to
100%. However, precision is lower than 50%, suggesting that about half of the
tweets were not ironic and should have not been classified as so.
    While the test labels are not revealed, it is not possible to make an error anal-
ysis. However, we believe that performances in the tweet dataset were harmed
5
    More precisely, our script had a switch for shuffling the data to label, which was on
    when the test data was classified, meaning that the submitted labels were not in the
    expected order for the official evaluation.
by the criteria adopted in the creation of the data, which we suspect to have di-
verged between training and test. For instance, in da Silva’s thesis [15], one of the
criteria for labelling tweets as ironic was the presence of the hashtag #ironia,
i.e., all the tweets using this hashtag were considered to be positive examples of
ironic tweets, and could be included in the training data as such. Yet, when we
search Twitter for the tweets of IDPT test data, all of them use the previous
hashtag, even if, according to the results, more than a half was not labelled as
ironic. This was probably the result of manual analysis, and should definitely be
more accurate than da Silva’s criteria, which would automatically label them as
ironic. However, this also means that the provided training data was misleading
and, as we have seen, classifiers trained on such data are not apt for correctly
labelling IDPT’s tweet test data.


                          Run BAcc Precision Recall F1
                      tweets 3 0.502         0.412 0.992 0.581
                        news 3 0.802         0.681 0.852 0.757
      Table 7. Official results of the traditional approach in the test datasets.


    On the news data, our traditional approach achieved the eighth position
overall, with a BAcc that was 12 points below the best run. The name of the
team that submitted the three best runs (TeamBERT4EVER) suggests that they
used BERT, which confirms that irony detection is one more task where BERT
is currently the way to achieve the state-of-the-art. Nevertheless, we highlight
that, in the traditional approach, a white-box model was used, which enabled us
to learn a bit more about how irony is expressed in the datasets, e.g., important
features, most of which we would not immediately associate with irony.


5   Conclusion

We have described the participation of the CISUC team in the IDPT 2021 shared
task. Despite our issues with the BERT models, the balance is still positive. This
participation lead to the application of known approaches to a new challenge, it
made us think about the relevance of irony, and taught us a little bit about the
way it is expressed in Portuguese.
    We tackled this challenge as a text classification task and explored both tradi-
tional and deep learning approaches. As expected, deep learning seems to be the
best path to achieve top performances, and BERT is definitely a solid model for
attempting at state-of-the-art results. Still, sometimes, learning about language
and how it works is at least as important as achieving the best performances,
and white-box models are much more accessible for this purpose. Unfortunately,
while the labels of the test data are not revealed, we cannot compare the per-
formance of the latter with our BERT-based approaches in a real scenario, and
thus not analyse the trade-off. The same happens to the comparison with the
runs of other teams. In the news test set, our LR-based approach was ranked
eight, and it will definitely be interesting to learn about the other approaches,
once the proceedings of IDPT are published.
    Now that we had our first contact with this topic, there are plenty of ideas for
future work. A possible direction would be studying to what extent it is possible
to learn a general classifier of irony, not suitable for a specific type of text or
time-span. We did train a BERT-based model on both training datasets (tweets
and news), but it was one of the corrupt submissions. A train-validation-test in
the previous dataset lead to a surprisingly high performance, i.e., comparable
to the performance of the type-specific models. But stronger conclusions can
only be taken once we actually test the model in the test data. Still, more than
learning from two (or more) different types of text, a general classifier would
also have to be trained in texts published in different time-spans. The latter
are relevant, because classifiers will learn from the used vocabulary and, during
specific time periods, some entities (e.g., politicians, athletes, organisations) can
be or become a preferred target of irony, thus skewing the model’s evaluation of
the words associated with these entities.
    Another interesting direction would be a deeper analysis of the actual impact
of different features, not only tokens, but also n-grams, case (upper or lower),
punctuation, emojis and lexicon-based features, among others. Besides possi-
bly improving the traditional approaches, some of those features could also be
appended to the inputs of the BERT-based classifiers.
    Finally, one could exploit other available corpora for irony detection, possibly
starting with a corpus of humour [6], which typically resorts to irony. More data
could also be retrieved from Twitter, possibly relying on additional heuristics
for self-labelling (e.g., specific emojis). However, given the specificities of irony,
quality of data is especially important. Therefore, any automatically-created
dataset should be manually revised.


Acknowledgements

This work is partially funded by national funds through the FCT – Founda-
tion for Science and Technology, I.P., within the scope of the project CISUC
– UID/CEC/00326/2020 and by European Social Fund, through the Regional
Operational Program Centro 2020.


References
 1. Carvalho, P., Sarmento, L., Silva, M.J., De Oliveira, E.: Clues for detecting irony
    in user-generated contents: oh...!! it’s” so easy”;-. In: Proceedings of the 1st inter-
    national CIKM workshop on Topic-sentiment analysis for mass opinion. pp. 53–56
    (2009)
 2. Corrêa, U.B., dos Santos, L.P., Coelho, L., de Freitas, L.A.: Overview of the
    IDPT task on Irony Detection in Portuguese at IberLEF 2021. Procesamiento
    del Lenguaje Natural 67 (2021)
 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidi-
    rectional transformers for language understanding. In: Procs. of 2019 Conference
    of the North American Chapter of the Association for Computational Linguistics:
    Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186.
    Association for Computational Linguistics (Jun 2019)
 4. Duarte, L., Macedo, L., Gonçalo Oliveira, H.: Exploring emojis for emotion recog-
    nition in Portuguese text. In: Proceedings of 19th EPIA Conference on Artificial
    Intelligence, EPIA 2019, Vila Real, Portugal, September 3-6, 2019, Proceedings,
    Part I. LNCS/LNAI, vol. 11805, pp. 719–730. Springer (September 2019)
 5. de Freitas, L.A., Vanin, A.A., Hogetop, D.N., Bochernitsan, M.N., Vieira, R.: Path-
    ways for irony detection in tweets. In: Proceedings of the 29th Annual ACM Sym-
    posium on Applied Computing. pp. 628–633 (2014)
 6. Gonçalo Oliveira, H., Clemêncio, A., Alves, A.: Corpora and baselines for humour
    recognition in Portuguese. In: Proceedings of 12th International Conference on
    Language Resources and Evaluation. pp. 1278-–1285. LREC 2020, ELRA, Mar-
    seille, France (2020)
 7. Grice, H.P.: Logic and conversation. In: Speech acts, pp. 41–58. Brill (1975)
 8. Liu, B.: Sentiment analysis and opinion mining. Synthesis lectures on human lan-
    guage technologies 5(1), 1–167 (2012)
 9. Mohammad, S.M., Turney, P.D.: Crowdsourcing a word-emotion association lexi-
    con 29(3), 436–465 (2013)
10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
11. Potamias, R.A., Siolas, G., Stafylopatis, A.G.: A transformer-based approach to
    irony and sarcasm detection. Neural Computing and Applications 32(23), 17309–
    17320 (2020)
12. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A Python natural
    language processing toolkit for many human languages. In: Proceedings of the
    58th Annual Meeting of the Association for Computational Linguistics: System
    Demonstrations (2020)
13. Rohanian, O., Taslimipoor, S., Evans, R., Mitkov, R.: WLV at SemEval-2018 task
    3: Dissecting tweets in search of irony. In: Proceedings of 12th International Work-
    shop on Semantic Evaluation. pp. 553–559. Association for Computational Lin-
    guistics, New Orleans, Louisiana (2018)
14. Sarmento, L., Carvalho, P., Silva, M.J., De Oliveira, E.: Automatic creation of a
    reference corpus for political opinion mining in user-generated content. In: Pro-
    ceedings of the 1st international CIKM workshop on Topic-sentiment analysis for
    mass opinion. pp. 29–36 (2009)
15. da Silva, F.R.A.: Detecção de ironia e sarcasmo em lı́ngua portuguesa: Uma abor-
    dagem utilizando deep learning. Tech. rep., Universidade Federal de Mato Grosso
    (2018)
16. Soares, A.P., Costa, A.S., Machado, J., Comesaña, M., Oliveira, H.M.: The Minho
    Word Pool: Norms for imageability, concreteness, and subjective frequency for
    3,800 Portuguese words. Behavior Research Methods 49(3), 1065––1081 (2017)
17. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: Pretrained BERT models for
    Brazilian Portuguese. In: Proceedings of the Brazilian Conference on Intelligent
    Systems (BRACIS 2020). LNCS, vol. 12319, pp. 403–417. Springer, Cham (2020)
18. Van Hee, C., Lefever, E., Hoste, V.: SemEval-2018 task 3: Irony detection in En-
    glish tweets. In: Proceedings of The 12th International Workshop on Semantic
    Evaluation. pp. 39–50 (2018)
19. Wu, C., Wu, F., Wu, S., Liu, J., Yuan, Z., Huang, Y.: THU NGN at SemEval-2018
    task 3: Tweet irony detection with densely connected LSTM and multi-task learn-
    ing. In: Proceedings of 12th International Workshop on Semantic Evaluation. pp.
    51–56. Association for Computational Linguistics, New Orleans, Louisiana (2018)