Introduction

CISUC at IDPT2021: Traditional and Deep Learning for Irony Detection in Portuguese

0 CISUC, Department of Informatics Engineering, University of Coimbra , Coimbra , Portugal

These notes describe the participation of the CISUC team in the IDPT 2021 shared task. Irony detection was tackled as a text classi cation task, where both traditional and transformer-based (BERT) approaches were explored. The former performed ok, but not everything went well, and the results achieved by BERT were not evaluated, due to an issue with our o cial submissions. Nevertheless, we still discuss some of the options taken, identify important features, and present validation results in the training data.

Irony Detection Portuguese Text Classi cation Transformers Logistic Regression

Introduction

Irony is a rhetorical device where interpretation should not be literal [18], because its meaning diverges signi cantly from, and is often the opposite [ 7 ], of the intended meaning. Irony detection is a subtask of Natural Language Processing aiming at the automatic classi cation of texts as ironic or not, and is extremely relevant for tasks like Sentiment Analysis and Opinion Mining [ 14 ]. But irony detection can be challenging, even for humans, who often rely on visual clues, like facial expression or tone [ 7 ], for recognising irony. This is especially true when irony is expressed through text only, despite studies on identifying textual clues for irony detection [ 1 ].

Irony detection has been tackled by several Natural Language Processing (NLP) researchers, who adopted di erent approaches. In 2018, there was a SemEval task on Irony Detection in English Tweets [18] that covered the binary classi cation of tweets as ironic or not. Best systems adopted a deep learning approach, e.g., a densely LSTM neural network, based on pre-trained static word embeddings, with syntactic and sentiment features [19]. But there were also more traditional approaches, e.g., an ensemble classi er with Logistic

Regression (LR) and a Support Vector Machine (SVM), considering pretrained word and emoji embeddings, as well as handcrafted sentiment and word-based features [ 13 ]. Since then, as it happened for other NLP tasks, pre-trained language models such as BERT [ 3 ], or variations like RoBERTa [ 8 ], were exploited for obtaining contextualized embeddings, which can be combined with a classier, e.g, a recurrent Convolutional Neural Network with a LSTM layer [ 11 ].

This paper describes the participation of a team from the Center of Informatics and Systems of the University of Coimbra (CISUC) in the Irony Detection in Portuguese (IDPT) task [ 2 ], included in the 2021 edition of the Iberian Languages Evaluation Forum (IberLEF). This was the rst time we tackled irony detection, but our interest follows previous work on text classi cation of Portuguese text, speci cally emotions [ 4 ] and humour [ 6 ].

Given that annotated datasets were made available by IDPT's organisation, one with tweets and another with news, we tackled IDPT with a supervised machine learning approach. Classi ers were learned from the training data and used for classifying the test data, then submitted to be evaluated. Yet, our rst step was to look at the data, in order to increase our sensibility with this domain. In the process, we noted some patterns and learned about the sources of the training data, which lead to a data cleaning process, described in Section 2. Following this, we decided to explore both traditional text classi cation approaches as well as a more recent deep learning approach. The former required us to set some parameters, including the number of features, but it also enabled us to analyse and learn about the most important features for irony, at least in the provided datasets. For this approach, Section 3 provides some insights on the previous process, including the inclusion of lexicon features. The same section describes the deep learning approach, based on the popular transformer-based architecture BERT [ 3 ]. For both approaches, we present the results of validation in the training data.

Before concluding, Section 4 has a brief discussion on the o cial results of the selected classi ers in IDPT. Unfortunately, due to due to our own mistake in the submission process, classi cations by the BERT classi ers were not properly evaluated, which made it impossible for us to know their real performance in the test dataset. On the other hand, the performance of the traditional approach, based on Logistic Regression, was good enough for an approach that could be seen as a baseline. This conclusion is mostly based on the results in the news dataset. From the performance in the tweets dataset it is hard to make conclusions. Even though the majority of tweets was automatically classi ed as ironic, according to the evaluation metrics, more than a half was not. Apparently, this issue was common to all participants. 2

Data

Our starting point was the data provided by IDPT's organisation, namely 15,213 tweets and 18,495 news documents, labelled as ironic (1) or non-ironic (0), which we used for training our models. Test data comprised 300 unlabelled tweets and 300 news documents. For evaluation purposes, test data had to be submitted with automatically-assigned labels.

While analysing the aforementioned datasets, we immediately noticed what could be a discrepancy between training and test data for tweets. Unlike the test data, training data contained little to no emojis, hashtags (#), user mentions (@), URLs, as well as no line breaks. Having in mind that this could have a negative impact on the classi cation task, and that some of those features could be relevant for irony detection, we tried to understand the di erences.

During this process, we learned about the criteria adopted in the creation of the training data, after reading some of the references provided by the organisation [ 5, 15 ]. Speci cally, we found the dataset created in the scope of da Silva's BSc thesis [ 15 ], which seemed to cover most of the tweets of the training data. However, this dataset was available1 in a slightly di erent format, where some of aforementioned missing items were either directly present in the textual content or could be recovered from additional properties.

One of such properties was the tweet ID, which enabled us to retrieve most of the original tweets through Twitter's API. With this, we con rmed the hashtagbased criteria adopted for automatically-labelling the dataset: { Ironic tweets were those containing the hashtags #ironia or #sarcasmo; { Non-Ironic tweets were those containing #economia, #politica or #educac~ao.

Based on da Silva's thesis, we made our own pre-processing of the dataset, which included: the complete removal of all ve hashtags above; the normalisation of user mentions and URLs, respectively replaced by @user and @link2. Table 1 illustrates this with tweets in the training dataset, provided by the organisation, the original tweets as published on Twitter, and the result after our preprocessing. Di erences towards the provided datasets was the inclusion of emojis, the complete removal of hashtags used for non-ironic tweets (e.g., #economia), as well as the normalisations. 3

Approaches

This section describes both approaches adopted in our participation in the IDPT task, a traditional machine learning approach, which could be seen as a baseline, and a deep learning approach based on BERT. Moreover, for each approach, validation results are presented and, for the traditional approach, we take a look at important features considered for detecting irony.

1 https://github.com/fabio-ricardo/deteccao-ironia

2 We later noticed that using the `@' character was not the best option, because some tokenizers split it from the following word. However, the impact for Portuguese should still be minimal, because these words are in English. Di erent traditional machine learning classi ers, implemented in the Python library scikit-learn [ 10 ], were trained and validated in di erent splits of the training data. For this purpose, documents were represented by TF-IDF vectors, also resorting to scikit-learn's T dfTransformer. Portuguese stopwords in the NLTK list were ignored in this process and di erent parameters were tested, namely the n-gram range, maximum document frequency, minimum document frequency, and maximum number of features. While experimenting in the training datasets, we decided on setting: { N-gram range to 1 (unigrams), as we saw no improvements with bigrams; { The maximum document frequency to 0.5, meaning that tokens occurring in more than half of the documents in the collection were ignored, for not being discriminant enough; { The minimum document frequency to 3, meaning that tokens occurring in only one or two documents were ignored, for not being frequent enough. We also tested di erent values for the maximum number of features. Cross-Validation Tables 2 and 3 report on the performance of three di erent classi ers in a 10-fold cross-validation, respectively in the tweets and news training datasets, using di erent numbers of features (500, 1,500 and 5,000). The three classi ers used were Logistic Regression (LR), Naive Bayes (NB), and Random Forest (RF), all white-box, and the metrics considered were: Balanced Accuracy (BAcc), for being the o cial measure of IDPT; Precision, Recall, and F1 score (F1).

Achieved performances are interesting for a baseline. As expected, performance is slightly higher for news, which should be more formal, than for tweets, where several conventions are broken. But top F1 scores of 89% and 97% may actually suggest that irony detection is not that hard, especially in formal text. However, these performances are achieved with 5,000 features, which probably leads to models that are over- tted to the training dataset. Therefore, having in mind that the documents in the test datasets were extracted for a di erent time period than the training, and there could be signi cant vocabulary di erences (e.g., due to di erent trending topics), we decided to consider not more than 1,500 features for our submission. Validation performances suggest that a lower number of features has a bigger impact for NB and that, on the other hand, LR is less a ected. In fact, for tweets, LR performs better with 1,500 than with 5,000 features. Adding to the simplicity of LR and to its best performance in the news dataset, we decided to use LR in our o cial IDPT runs. Lexicon-based features We further decided to explore additional features that we thought could be useful for irony detection, namely: { Concreteness and imageability scores, obtained from the Minho Word Pool norms [ 16 ], where 3,800 Portuguese words have averages of such properties, from 1 to 7, assigned by several judges; { Sentiment and emotion features, acquired from the NRC Emotion lexicon [ 9 ], where such features (0 or 1) are assigned to 14,182 English words through crowdsourcing, then translated to other languages, including Portuguese. This resulted in ten extra features, averaged for each document: Concreteness, Imageability, Positive, Negative, Anger, Anticipation, Disgust, Fear, Joy, Trust. Our intuition is that these could complement the TF-IDF features, because, indirectly, they end up covering a larger vocabulary, more focused and independent of the training data, and may thus lead to less over- tting. Since the entries in the previous lexicons are all lemmas, for computing these features, documents were rst lemmatized, using the Portuguese models of the Stanza [ 12 ] package.

Table 4 shows the performance of the LR classi er using only the extra features or adding them to the 1,500 TF-IDF features. When used alone, their impact is irrelevant for the Tweets, but they seem to make a di erence for the News. Alone, they achieve a F1 of 0.71, but when together with TF-IDF, F1 drops by 1 point.

Our option for including these features anyway is further supported by an analysis of their importance coe cient in an LR classi er that learned from them only, and of their values in documents of di erent classes, especially in the News dataset. For instance, ironic news express slightly more joy, negativity, disgust and anticipation, and are also more imagetic.

Feature Importance After training a LR classi er, each feature has an importance coe cient, which can be useful for interpretation. Tables 5 show the most important features when the previous classi er is trained in the training datasets, as well as the number of documents where they occur and the proportion classi ed as ironic. Some interesting insights can be observed. For instance, most tweets with user mentions (normalised as `@user') are ironic, and so are more \extreme" tweets that use words like `adoro', `tudo' or `nada'. As for the news, many relevant features for irony are names of politicians, suggesting that they are common targets of irony, or were during the time-span the data was collected. Other features include words that typically appear before a citation, namely `disse' and `explicou'. 3.2

Transformer-Based Approach BERT [ 3 ] is a transformer-based model widely used in Natural Language Processing since its release, by Google. It is pretrained in two general language tasks, masked language modelling and next sentence prediction, but can be ne-tuned for other tasks, including text classi cation, which is our case. Fine-tuning Our starting point was BERTimbau [ 17 ], i.e., BERT Base Portuguese Cased (BERT-PT), a model with 110M parameters, pretrained by Neuralmind, exclusively for (Brazilian) Portuguese. In order to ne-tune this model for irony classi cation, we used the BertForSequenceClassi cation class of the Transformers library3, which adds a classi cation head on top of BERT. Parameters for this model were empirically selected, namely: batch size of 16 for the tweets and 8 for the news, due to memory limitations; and Adam optimizer4 for being the common option, with lr=2e 5 and eps=1e 8.

Text Size We quickly came across a limitation on the text size, i.e., some documents were longer than the maximum number of tokens that BERT could handle (510 word pieces, plus the initial [CLS] and the nal [SEP] tokens). Figures 1 show the distribution of documents according to the number of tokens in both training sets. As expected, this is much more frequent in the news dataset, as news articles tend to be longer than tweets. Still, after careful analysis and deliberation, we assumed that the proportion of documents that exceeded the limit of tokens was insigni cant and deemed that their absence would not produce a noticeable change in the model's overall performance. This left us with two choices: remove the longer documents from the dataset or truncate them. We chose the latter for several reasons. The rst 510 tokens of each document would still be relevant for irony detection and, this way, the classi er would learn from all data. Moreover, documents in the test dataset could also exceed this limit and we could not simply remove them.

Validation results In order to select the aforementioned parameters, the BERT-based classi er was validated in the training dataset. For this, we used 60% of the data for training, 10% for validation and 30% for test. Table 6 summarises the performance of the best models of each kind.

Validation performances achieved with BERT are very high in both datasets and outperform the already high F1 of the best traditional approaches, con rming that BERT is a very powerful model. Additional experiments were performed

3 See https://huggingface.co/transformers/model doc/bert.html 4 See https://huggingface.co/transformers/main classes/optimizer schedules.html

with balanced versions of the datasets, obtained with undersampling, but they did not lead to further improvements. We should, nevertheless, recall that these results can be too over- tted to the training data. 4

Results

Despite all the experiments performed with BERT and our positive expectations regarding their high performance, due to a mistake in our submission5, it is impossible for us to take any conclusion on the real performance of the BERT-based models and on its comparison to other approaches, including our traditional approach. At least until the labels of the test data are not revealed.

As for the traditional approach, o cial results are in Table 7. We recall that the model used is based in LR, with 1,500 token features plus 10 lexicon-based features. It achieved sixth position overall in the tweets dataset, but it was not much di erent from other participants, nor from our BERT submissions where the labels were shu ed. In fact, when analysing our labels, we note that the majority of the tweets were classi ed as ironic, which results in a recall close to 100%. However, precision is lower than 50%, suggesting that about half of the tweets were not ironic and should have not been classi ed as so.

While the test labels are not revealed, it is not possible to make an error analysis. However, we believe that performances in the tweet dataset were harmed 5 More precisely, our script had a switch for shu ing the data to label, which was on when the test data was classi ed, meaning that the submitted labels were not in the expected order for the o cial evaluation. by the criteria adopted in the creation of the data, which we suspect to have diverged between training and test. For instance, in da Silva's thesis [ 15 ], one of the criteria for labelling tweets as ironic was the presence of the hashtag #ironia, i.e., all the tweets using this hashtag were considered to be positive examples of ironic tweets, and could be included in the training data as such. Yet, when we search Twitter for the tweets of IDPT test data, all of them use the previous hashtag, even if, according to the results, more than a half was not labelled as ironic. This was probably the result of manual analysis, and should de nitely be more accurate than da Silva's criteria, which would automatically label them as ironic. However, this also means that the provided training data was misleading and, as we have seen, classi ers trained on such data are not apt for correctly labelling IDPT's tweet test data.

On the news data, our traditional approach achieved the eighth position overall, with a BAcc that was 12 points below the best run. The name of the team that submitted the three best runs (TeamBERT4EVER) suggests that they used BERT, which con rms that irony detection is one more task where BERT is currently the way to achieve the state-of-the-art. Nevertheless, we highlight that, in the traditional approach, a white-box model was used, which enabled us to learn a bit more about how irony is expressed in the datasets, e.g., important features, most of which we would not immediately associate with irony. 5

Conclusion

We have described the participation of the CISUC team in the IDPT 2021 shared task. Despite our issues with the BERT models, the balance is still positive. This participation lead to the application of known approaches to a new challenge, it made us think about the relevance of irony, and taught us a little bit about the way it is expressed in Portuguese.

We tackled this challenge as a text classi cation task and explored both traditional and deep learning approaches. As expected, deep learning seems to be the best path to achieve top performances, and BERT is de nitely a solid model for attempting at state-of-the-art results. Still, sometimes, learning about language and how it works is at least as important as achieving the best performances, and white-box models are much more accessible for this purpose. Unfortunately, while the labels of the test data are not revealed, we cannot compare the performance of the latter with our BERT-based approaches in a real scenario, and thus not analyse the trade-o . The same happens to the comparison with the runs of other teams. In the news test set, our LR-based approach was ranked eight, and it will de nitely be interesting to learn about the other approaches, once the proceedings of IDPT are published.

Now that we had our rst contact with this topic, there are plenty of ideas for future work. A possible direction would be studying to what extent it is possible to learn a general classi er of irony, not suitable for a speci c type of text or time-span. We did train a BERT-based model on both training datasets (tweets and news), but it was one of the corrupt submissions. A train-validation-test in the previous dataset lead to a surprisingly high performance, i.e., comparable to the performance of the type-speci c models. But stronger conclusions can only be taken once we actually test the model in the test data. Still, more than learning from two (or more) di erent types of text, a general classi er would also have to be trained in texts published in di erent time-spans. The latter are relevant, because classi ers will learn from the used vocabulary and, during speci c time periods, some entities (e.g., politicians, athletes, organisations) can be or become a preferred target of irony, thus skewing the model's evaluation of the words associated with these entities.

Another interesting direction would be a deeper analysis of the actual impact of di erent features, not only tokens, but also n-grams, case (upper or lower), punctuation, emojis and lexicon-based features, among others. Besides possibly improving the traditional approaches, some of those features could also be appended to the inputs of the BERT-based classi ers.

Finally, one could exploit other available corpora for irony detection, possibly starting with a corpus of humour [ 6 ], which typically resorts to irony. More data could also be retrieved from Twitter, possibly relying on additional heuristics for self-labelling (e.g., speci c emojis). However, given the speci cities of irony, quality of data is especially important. Therefore, any automatically-created dataset should be manually revised.

Acknowledgements

This work is partially funded by national funds through the FCT { Foundation for Science and Technology, I.P., within the scope of the project CISUC { UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020. 18. Van Hee, C., Lefever, E., Hoste, V.: SemEval-2018 task 3: Irony detection in English tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation. pp. 39{50 (2018) 19. Wu, C., Wu, F., Wu, S., Liu, J., Yuan, Z., Huang, Y.: THU NGN at SemEval-2018 task 3: Tweet irony detection with densely connected LSTM and multi-task learning. In: Proceedings of 12th International Workshop on Semantic Evaluation. pp. 51{56. Association for Computational Linguistics, New Orleans, Louisiana (2018)

1. Carvalho , P. , Sarmento , L. , Silva , M.J. , De Oliveira , E.: Clues for detecting irony in user-generated contents: oh ...! ! it's" so easy";- . In: Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion . pp. 53 { 56 ( 2009 )

2. Corr^ea, U.B., dos Santos , L.P. , Coelho , L., de Freitas , L.A. : Overview of the IDPT task on Irony Detection in Portuguese at IberLEF 2021 . Procesamiento del Lenguaje Natural 67 ( 2021 )

3. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Procs. of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . pp. 4171 { 4186 . Association for Computational Linguistics ( Jun 2019 )

4. Duarte , L. , Macedo , L. , Goncalo

Oliveira

, H.: Exploring emojis for emotion recognition in Portuguese text . In: Proceedings of 19th EPIA Conference on Arti cial Intelligence , EPIA 2019 , Vila

Real

, Portugal, September 3-6 , 2019 , Proceedings, Part I. LNCS /LNAI, vol. 11805 , pp. 719 { 730 . Springer ( September 2019 )

5. de Freitas, L.A. , Vanin , A.A. , Hogetop , D.N. , Bochernitsan , M.N. , Vieira , R.: Pathways for irony detection in tweets . In: Proceedings of the 29th Annual ACM Symposium on Applied Computing . pp. 628 { 633 ( 2014 )

Goncalo

Oliveira , H. , Clem^encio, A. , Alves , A. : Corpora and baselines for humour recognition in Portuguese . In: Proceedings of 12th International Conference on Language Resources and Evaluation . pp. 1278 - { 1285. LREC 2020 , ELRA, Marseille, France ( 2020 )

7. Grice , H.P. : Logic and conversation . In: Speech acts, pp. 41 { 58 . Brill ( 1975 )

8. Liu , B. : Sentiment analysis and opinion mining . Synthesis lectures on human language technologies 5(1) , 1 { 167 ( 2012 )

9. Mohammad , S.M. , Turney , P.D.: Crowdsourcing a word-emotion association lexicon 29(3 ), 436 { 465 ( 2013 )

10. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , Vanderplas , J. , Passos , A. , Cournapeau , D. , Brucher , M. , Perrot , M. , Duchesnay , E.: Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 { 2830 ( 2011 )

11. Potamias , R.A. , Siolas , G. , Stafylopatis , A.G. : A transformer-based approach to irony and sarcasm detection . Neural Computing and Applications 32 ( 23 ), 17309 { 17320 ( 2020 )

12. Qi , P. , Zhang, Y. , Zhang , Y. , Bolton , J. , Manning , C.D.: Stanza: A Python natural language processing toolkit for many human languages . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations ( 2020 )

13. Rohanian , O. , Taslimipoor , S. , Evans , R. , Mitkov , R.: WLV at SemEval -2018 task 3: Dissecting tweets in search of irony . In: Proceedings of 12th International Workshop on Semantic Evaluation . pp. 553 { 559 . Association for Computational Linguistics, New Orleans, Louisiana ( 2018 )

14. Sarmento , L. , Carvalho , P. , Silva , M.J. , De Oliveira , E.: Automatic creation of a reference corpus for political opinion mining in user-generated content . In: Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion . pp. 29 { 36 ( 2009 )

15. da Silva , F.R.A. : Detecca~o de ironia e sarcasmo em l ngua portuguesa: Uma abordagem utilizando deep learning . Tech. rep. , Universidade Federal de Mato Grosso ( 2018 )

16. Soares , A.P. , Costa , A.S. , Machado , J. , Comesan~a, M. , Oliveira , H.M.: The Minho Word Pool: Norms for imageability, concreteness, and subjective frequency for 3,800 Portuguese words . Behavior Research Methods 49 ( 3 ), 1065 {{ 1081 ( 2017 )

17. Souza , F. , Nogueira , R. , Lotufo , R.: BERTimbau: Pretrained BERT models for Brazilian Portuguese . In: Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS 2020 ). LNCS, vol. 12319 , pp. 403 { 417 . Springer, Cham ( 2020 )