=Paper=
{{Paper
|id=Vol-3180/paper-195
|storemode=property
|title=Irony and Stereotype Spreading Author Profiling on Twitter using Machine Learning:
A BERT-TFIDF based Approach
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-195.pdf
|volume=Vol-3180
|authors=Amit Das,Nilanjana Raychawdhary,Gerry Dozier,Cheryl D. Seals
|dblpUrl=https://dblp.org/rec/conf/clef/DasRDS22
}}
==Irony and Stereotype Spreading Author Profiling on Twitter using Machine Learning:
A BERT-TFIDF based Approach==
Irony and Stereotype Spreading Author Profiling on Twitter using Machine Learning: A BERT-TFIDF based Approach Amit Das, Nilanjana Raychawdhary, Gerry Dozier and Cheryl D. Seals Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA Abstract In this paper we introduce our system for the task of determining whether an author spreads Irony and Stereotype in English tweets or not, a part of PAN 2022 (IROSTEREO) task. For the irony spreading author classification task, 600 authors each containing 200 tweets have been used. The uniqueness of the task is that it is not a classification between ironic and non ironic tweets, instead it is a classification of irony and non irony spreading authors. The task contains a subtask also that addresses stereotype stance detection. For the previous years, several representation methods like character/word n-grams etc. have been used for tweet representations, but there was not a clear clue whether a combination of other representations would be helpful. To do this end, we introduce BERT combined with TFIDF representation to address this specific problem. Later we used Logistic Regression classifier for the classification task. It was seen that the BERT representation combined with TFIDF showed very promising results. Keywords Irony detection, Author profiling, Natural language processing, Twitter data 1. Introduction Irony is a deeply pragmatic and diverse linguistic phenomena that has been thoroughly explored in numerous domains [1]. Irony detection has recently gained a lot of interest in the machine learning and NLP world due to the high frequency of sarcastic expressions in social media [2]. In the context of sentiment analysis, their language collocation has a tendency to flip polarity, making machine-based irony identification important [3] [4]. The goal of irony detection is to develop computational algorithms that automatically recognize this phenomena in written languages [5] [6] [7]. Several researchers have attempted the irony detection problem, according to the literature [8] [9]. Many of these efforts have been devoted to the examination of the textual representation and features [5] [6] [10]. PAN 2022’s Irony Detection job focuses on characterizing irony and stereotype spreaders on Twitter. The work focuses on characterizing ironic authors on Twitter, with a focus on authors that use irony to disseminate prejudices about women or the LGBT community, for example [11] [12]. The task’s purpose is to categorize authors as ironic or not based on how many tweets they have with ironic content [13]. A subgroup of those authors is studied who use irony to express stereotypes in order to see if CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ azd0123@auburn.edu (A. Das); nzr0044@auburn.edu (N. Raychawdhary); doziegv@auburn.edu (G. Dozier); sealscd@auburn.edu (C. D. Seals) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) state-of-the-art models can differentiate these cases as well. As a result, given a list of Twitter users and their tweets, the purpose is to identify those who can be classified as ironic. A subtask of the task deals with Stereotype Stance Detection. Ironic authors have used stereotypes to harm the target (such as immigrants) or to support it in some way. This subtask’s objective is to determine whether sarcastic authors are using stereotypes to support or undermine the target. The objective is to identify their general viewpoint given the subset of sarcastic authors who used stereotypes in some of their tweets. The TIRA platform has been used by the participants to assess their approaches. This platform can be used to deploy and test applications automatically [14]. The algorithms are assessed using a common test dataset, the same metrics, as well as the amount of time required to generate the response. Irony is a profoundly pragmatic and versatile linguistic phenomenon. As its foundations usually lay beyond explicit linguistic patterns in reconstructing contextual dependencies and latent meaning, such as shared knowledge or common knowledge [2], automatically detecting it remains a challenging task in natural language processing. In this paper, we use two repre- sentations: 1) BERT and 2) Term Frequency Inverse Document Frequency (TFIDF) combined with BERT to address this issue. Later the classification task was implemented using Logistic Regression classifier. 2. Related Work Irony detection is a very challenging task that encountered a lot of development through the years. Here are some of the recent research works that contribute to the problem. Identifying the important components to recognize irony in English customer evaluations has been the attention of Reyes and Rosso [15]. To reflect irony, they used six categories in their model: n-grams, POS ngrams, funny profiling, positive/negative profiling, affective profiling, and pleasantness profiling. Customers’ online reviews were chosen as part of the dataset [16]. They employed three distinct classifiers to reach their results, which showed very competitive performance. The automatic detection of irony was framed as a classification problem by Barbieri and Saggion [17]. They created a model that could detect irony in the social network Twitter using linguistic variables like frequency, written/spoken contrasts, attitudes, ambiguity, intensity, synonymy, and structure. Nayel et al. [16] picked tweets with the hashtag irony and a few other subjects to generate a linguistically motivated set of features. Their model outperformed the bag-of-words technique across domains, according to the findings. Teh et al.’s [18] investigation focused on the use of coarse language for the detection of hate speech. Based on the use of profanity, the writers divided 500 YouTube comments into 8 different categories of hate speech. Numerous other studies of a similar nature focused on identifying hate speech [18] [19], social media abuse [20] [21], fake news on Twitter [22], and cyberbullying [19]. On author profile based on their tweets, numerous publications and shared tasks are available [22] [23] [24]. A model for irony detection in Twitter emotIDM [25] was developed by formulation of the task as a classification problem. It was evaluated on a set of representative Twitter corpora that included samples of ironic and non ironic messages, which were different along various dimensions like size, balanced vs imbalance distribution, collection methodology and criteria [16]. Results showed good performances in classification. KLUEnicorn [26] offered a system that used a Naive Bayes classifier to build word embeddings using several adverb categories and named entities, as well as semantic and lexical data. Various supervised classification techniques, such as Randomizable Filtered Classifier (RFC), Bayesian Network (BayesNet), IBk, and others, were reviewed and compared in another comprehensive review [27]. In order to increase Stereotype Stance Detection, Mohammad et al. [28] looked into the significance of utilizing the sentiment that is expressed in a text. Without taking into account the target, the total sentiment expressed in each occurrence was annotated in the SemEval-2016 Task 6 dataset. They used n-grams, char-grams, sentiment features from many lexica, including the Hu and Liu lexicon [29], EmoLex [30], and the MPQA Subjectivity Lexicon [31]. Additionally, they took into account the target of interest in the tweet’s appearance or absence, as well as the frequency of part-of-speech tags, emoticons, hashtags, uppercase letters, lengthened phrases, and punctuation. They were able to outperform the competition by combining these features with a support vector machine classifier. In order to forecast the authors’ ages using the Maximum Entropy classifier and LASSO regression, Hong et al. [32] combined numerous datasets, including Fisher English transcript and Blog authorship, to create a dataset with a variety of stylistic and content-based variables. With the exception of increased age limits, both models produced good outcomes. To detect users with different perspectives in regards to stereotype stance detection in tweets, Rajadesingan and Liu [33] employed a semi-supervised framework in conjunction with a supervised classifier. The authors took advantage of a retweet-based label propagation theory, which is based on the fact that if a lot of users retweet a specific pair of tweets in a reasonable amount of time, it is quite likely that the two tweets are related in some way. Based on how closely a tweet aligns with the ideals of the labels surrounding it, they categorized it as "for" or "against" in their study. A label propagation technique was employed for community discovery in the work of Ragha- van et al. [34]. Their method was exceptionally straightforward and effective; in fact, each node adopted the label that the majority of its immediate neighbors now have in their iterative procedure, and it appeared to perform exceptionally well in unsupervised environments. In order to get meaningful phrase embeddings, there are new methods for fine-tuning language models [35] [36]. Using the universal sentence encoder, TFIDF, and a support vector machine for the case law retrieval challenge in the last COLIEE edition, Rabelo et al. [37] outperformed many of the models. Therefore, we expect that TFIDF in conjunction with BERT representation could also be effective for the task of identifying irony authors. The next section explains the datasets used in this research. 3. Dataset For the irony detection task, the dataset contained tweets of 600 authors each having 200 tweets. It was split into two categories: 1) validation dataset containing tweets of 420 authors and 2) test dataset containing tweets of 180 authors. The validation dataset is a balanced dataset (50% of them were irony and 50% of them were non irony) containing total 84000 tweets (420 authors each having 200 tweets). This dataset is used for the training purposes. For testing, 180 authors each containing 200 tweets are used. The training set is balanced, i.e. out of 420 authors 210 are irony and 210 are not irony. The details of the dataset is shown in Table 1. Table 1 Irony and Non irony spreading author Dataset Data split No. of authors No. of tweets of each author Training 420 200 Testing 180 200 For the Stereotype Stance Detection subtask, the dataset contained tweets of 200 authors each having 200 tweets. It was again split into two categories: 1) validation dataset containing tweets of 140 authors and 2) test dataset containing tweets of 60 authors. The validation dataset is an imbalanced dataset containing 28000 tweets (140 authors each having 200 tweets). This dataset is used for the training purposes. For testing, 60 authors each containing 200 tweets are used. The goal of this subtask is to detect the stance of how stereotypes are used by ironic authors, if in favour or against the target. Table 2 shows the details of the dataset. The training set is imbalanced, i.e. out of 140 authors 94 are AGAINST and 46 are INFAVOR. Table 2 Stereotype spreading author Dataset Data split No. of authors No. of tweets of each author Training 140 200 Testing 60 200 4. Methods In this work we implement the following method for tweet representation: BERT combined with TFIDF. We detail each of the feature spaces in the following lines: 4.1. BERT In this section, we’ll go over BERT and how to use it in depth. The design of a neural encoder for natural language sequences has been changed by Transformer [38], a sequence transduction model based on attention mechanisms. The transformer architecture allows sequential data to be learned. To improve on largely unidirectional language model training, Devlin et al. [39] developed Bidirectional Encoder Representations from Transformers (BERT). BERT makes deep bidirectional language encoding training achievable by employing the masked language modeling (MLM) loss [40]. BERT employs next-sentence prediction (NSP), an extra loss for pre-training that aims to learn high-level linguistic coherence by predicting whether or not two text segments should come sequentially in the original text [40]. The sentence had to be tokenized first before the embeddings could be created. Point to be noted, BERT can only handle sentences with a length of 512 tokens or less. BERT’s authors advise using the BERT Base Uncased model in the majority of cases unless it is clear that using a case-sensitive model will be beneficial to the task [41]. Using 1s and 0s to discriminate between the two sentences, BERT is trained on and anticipates sentence pairs [41]. That is to say, we must indicate whether each token in "tokenized text" fits in sentence 0 (a string of 0s) or sentence 1 (a series of 1s). We constructed a vector of 1s for each token in our input sentence since single-sentence inputs just need a string of 1s for our needs [41]. We then called the BERT model after converting our data to torch tensors. The number of layers (13 layers), the batch number (1 sentence), the word/token number (22 tokens in our sentence), and the hidden unit/feature number make up the complete set of hidden states for this model (768 features). The first element represents the input embeddings, and the remaining elements are the outputs of each of the 12 layers of BERT, hence the layer number is 13 [41]. When sending several sentences to the model at once, the batch size, the second dimension, is employed. There would be one batch total. We had 13 distinct vectors, each of which was 768 bytes long, for each token in our input. We combined the final four layers to produce a word vector with a length of 3072 (4 × 768 = 3072) per token. We calculated the average of the second-to-last hidden layer of each token to produce a single vector of 768 length for our complete text[41]. 4.2. TFIDF We apply the well-known Term Frequency Inverse Document Frequency (TFIDF) weighting system in our methodology to extract traditional features. TFIDF is a combination of two different terms: Term Frequency (TF) and Inverse Document Frequency (IDF) [42]. The term TF is used to calculate the frequency of a term in a document [43]. The term frequency for a term 𝑡 and a document 𝑑 is defined by 𝑛𝑑 ,𝑡 𝑡𝑓𝑑 ,𝑡 = (1) |𝑑| where 𝑛𝑑 ,𝑡 is the number of occurrences of the term 𝑡 in the document 𝑑. The term frequency 𝑡𝑓𝑑 ,𝑡 is then the number of occurrence of the term 𝑡 in document 𝑑 divided by the total number of tokens in the document. The inverse document frequency of a term 𝑡 in the whole collection is, |𝐷| 𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 (2) |𝑑 : 𝑡 ∈ 𝑑| where |D| is the number of classes in the classification problem and |𝑑 : 𝑡 ∈ 𝑑| is the number of document(s) where the term 𝑡 appears. When calculating a document’s term frequency, it can be seen that the algorithm evaluates all keywords similarly, regardless of whether they are stop words or not, which is incorrect because all keywords have varying relevance [43]. The inverse document frequency method gives less weight to often occurring words and more weight to infrequently occurring terms [43]. Mathematically, TFIDF is the multiplication of term frequency (TF) and inverse document frequency (IDF). The formula that is used to compute the TFIDF of term 𝑡 present in document 𝑑 is: 𝑛𝑑 ,𝑡 |𝐷| 𝑡𝑓 𝑖𝑑𝑓𝑑 ,𝑡 = 𝑡𝑓𝑑 ,𝑡 *𝑖𝑑𝑓𝑡 = * 𝑙𝑜𝑔 (3) |𝑑| |𝑑 : 𝑡 ∈ 𝑑| TFIDF’s purpose is to lessen the impact of less informative tokens that appear frequently in a data corpus [44]. We used TfidfVectorizer features from scikit-learn to perform the TFIDF task [45]. Table 3 shows the TFIDF parameter values used for the tasks. Table 3 TFIDF parameters Task TFIDF_max_df TFIDF_min_df Irony spreading author profiling 0.70 1 Stereotype spreading author profiling 0.95 1 4.3. BERT combined with TFIDF Sentence-BERT, which surpasses the current embedding techniques and is deemed effective for numerous downstream applications, was introduced by Reimers et al. [36]. TFIDF is used to evaluate how relevant a word is to a document in a collection of documents. The TFIDF score can be fed to the Bert model to improve the predicting performance. In order to produce a deeper and more insightful quantitative representation of the data, we used this embedding approach. We combined TFIDF with word embedding. We put a threshold of <1000 words while implementing the TFIDF. The idea is to preserve the grammatical regularities in each document intact. 4.4. Classifier The Logistic Regression classifier is used in order to classify the irony and stereotype spreading authors. Logistic Regression uses logistic function to model binary dependent variable. The equation can be given as: 𝑒𝑎+𝑏𝑥 𝑃 = (4) 1 + 𝑒𝑎+𝑏𝑥 We used LogisticRegression class from scikit-learn library to implement Logistic Regression model [46]. Figure 1 shows the architecture of our proposed model. After splitting the dataset into training and testing set, they are first tokenized using BERT representation. Then the TFIDF is combined with it to make the training dataset richer. Lastly the training and the predictions are made using Logistic Regression classifier. For the vectorization, every word is assigned a unique number. Each data is transformed into an N-dimensional vector, where N is a number of words in the data. Figure 1: Architecture of the Proposed Method In the following section we will explain the evaluation results of all the models used on validation dataset and the test dataset. 5. Results & Discussion To understand the efficiency of our models, first we split the validation data into training and testing set. Out of 420 authors, we used 336 of them as training and remaining 84 were used for testing. The datasets were split balanced, i.e. for both the training and testing data, 50% of them were irony and 50% of them were non irony. We combined all 200 tweets of each person and treated it as a single string. It basically created 420 strings for 420 authors each string containing the combination of 200 tweets of each author. Initially we were only concerned about the accuracy of different machine learning models for only the BERT representation. The strings were then converted to numeric values using the BERT. We implemented the following five machine learning algorithms to check the best accuracy: KNN, SVM, Decision Tree, Naive Bayes and Logistic Regression and it was seen that the Logistic Regression was proved to be the best in terms of efficiency and accuracy. To measure the accuracy of an algorithm, we used the formula in equation 5. 𝑇 𝑟𝑢𝑒𝐿𝑎𝑏𝑒𝑙 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (5) 𝑇 𝑟𝑢𝑒𝐿𝑎𝑏𝑒𝑙 + 𝐹 𝑎𝑙𝑠𝑒𝐿𝑎𝑏𝑒𝑙 Where True Label refers to correct prediction and False Label refers to incorrect classification. The classification results obtained from the five algorithms on the validation dataset are given below: Table 4 ML algorithms implemented on ironic author profiling validation dataset Algorithm Accuracy(%) KNN 89.2 SVM 88 Decision Tree 77.3 Naive Bayes 91.6 Logistic Regression 92.8 The BERT implementation method was then implemented on the testing dataset of tweets of 180 authors after training it with the validation dataset tweets of 420 authors. It was seen that the classification results were not very promising. However, after combining the BERT representation with TFIDF, the classification results were improved from 38% to 67% which was an increase of around 76%. The classification results on the test dataset are shown in Table 5. Table 5 Accuracy on ironic author profiling test dataset Representation Classifier Accuracy(%) BERT Logistic Regression 38 BERT-TFIDF Logistic Regression 67 The similar method was implemented to address the subtask of stereotype stance detection used by ironic authors either in favour or in against. Unlike the dataset used for ironic author classification, the dataset used for stereotype stance detection was smaller in size and also imbalanced. The classification result of stereotype stance detection problem using our model is shown in Table 6. We obtained an overall macro F1 score of 0.45 and F1 score of 0.19 InFavour. Table 6 Accuracy on stereotype favouring author profiling test dataset Representation Classifier Accuracy(%) BERT-TFIDF Logistic Regression 58 The BERT representation alone itself was not sufficient to make the classification results high. The TFIDF’s purpose is to increase the impact of more informative tokens that appear frequently in a data corpus. When the TFIDF was combined with BERT representation, the accuracy of the model was significantly improved. When we compare the classification accuracies of the two tasks, the accuracy of the irony detection problem was higher than the stereotype stance detection problem probably because of larger number of training and testing dataset. It was very interesting to see the usefulness of this model on both balanced and imbalanced type of dataset. 6. Conclusion In this paper we presented our method to PAN 2022 Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) task to address the irony spreading author detection problem on twitter data. The task contained a subtask also that addresses Stereotype Stance Detection i.e. detecting the stance of how stereotypes are used by ironic authors, if in favour or against the target. To address both the tasks, first we implemented BERT method and then BERT combined with TFIDF for representation of the tweets. Then Logistic Regression classifier was used to classify the irony and non-irony spreading authors. The BERT method was used to convert the text data into equivalent numeric data, and the TFIDF represented the more important tokens. It was seen that combining BERT representation with TFIDF significantly improved the result. To conclude, we have shown some useful techniques for irony and stereotype spreaders classification. How this model behaves to a different type of dataset, will be a future direction to explore. References [1] E. Marrese-Taylor, S. Ilic, J. A. Balazs, Y. Matsuo, H. Prendinger, Iiidyt at semeval-2018 task 3: Irony detection in english tweets, arXiv preprint arXiv:1804.08094 (2018). [2] A. Joshi, P. Bhattacharyya, M. J. Carman, Automatic sarcasm detection: A survey, ACM Computing Surveys (CSUR) 50 (2017) 1–22. [3] S. Poria, E. Cambria, D. Hazarika, P. Vij, A deeper look into sarcastic tweets using deep convolutional neural networks, arXiv preprint arXiv:1610.08815 (2016). [4] C. Van Hee, E. Lefever, V. Hoste, Guidelines for annotating irony in social media text, version 2.0, LT3 Technical Report Series (2016). [5] D. Davidov, O. Tsur, A. Rappoport, Semi-supervised recognition of sarcasm in twitter and amazon, in: Proceedings of the fourteenth conference on computational natural language learning, 2010, pp. 107–116. [6] T. Veale, Y. Hao, Detecting ironic intent in creative comparisons, in: ECAI 2010, IOS Press, 2010, pp. 765–770. [7] R. González-Ibánez, S. Muresan, N. Wacholder, Identifying sarcasm in twitter: a closer look, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 581–586. [8] A. Reyes, P. Rosso, On the difficulty of automatically detecting irony: beyond a simple case of negation, Knowledge and Information Systems 40 (2014) 595–614. [9] A. Reyes, P. Rosso, T. Veale, A multidimensional approach for detecting irony in twitter, Language resources and evaluation 47 (2013) 239–268. [10] A. Ghosh, G. Li, T. Veale, P. Rosso, E. Shutova, J. Barnden, A. Reyes, Semeval-2015 task 11: Sentiment analysis of figurative language in twitter, in: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), 2015, pp. 470–478. [11] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl, R. Ortega-Bueno, P. Pęzik, M. Potthast, et al., Overview of pan 2022: Authorship ver- ification, profiling irony and stereotype spreaders, style change detection, and trigger detection, in: European Conference on Information Retrieval, Springer, 2022, pp. 331–338. [12] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl, R. Ortega-Bueno, P. Pezik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wieg- mann, M. Wolska, E. Zangerle, Overview of PAN 2022: Authorship Verification, Profiling Irony and Stereotype Spreaders, and Style Change Detection, in: M. D. E. F. S. C. M. G. P. A. H. M. P. G. F. N. F. Alberto Barron-Cedeno, Giovanni Da San Martino (Ed.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture Notes in Computer Science, Springer, 2022. [13] O.-B. Reynier, C. Berta, R. Francisco, R. Paolo, F. Elisabetta, Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) at PAN 2022, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2022. [14] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5. [15] A. Reyes, P. Rosso, Mining subjective knowledge from customer reviews: A specific case of irony detection, in: Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis (WASSA 2.011), 2011, pp. 118–124. [16] H. A. Nayel, W. Medhat, M. Rashad, Benha@ idat: Improving irony detection in arabic tweets using ensemble approach., in: FIRE (Working Notes), 2019, pp. 401–408. [17] F. Barbieri, H. Saggion, Modelling irony in twitter, in: Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 56–64. [18] P. L. Teh, C.-B. Cheng, W. M. Chee, Identifying and categorising profane words in hate speech, in: Proceedings of the 2nd International Conference on Compute and Data Analysis, 2018, pp. 65–69. [19] Y. Chen, Detecting offensive language in social medias for protection of adolescent online safety (2011). [20] E. Chandrasekharan, U. Pavalanathan, A. Srinivasan, A. Glynn, J. Eisenstein, E. Gilbert, You can’t stay here: The efficacy of reddit’s 2015 ban examined through hate speech, Proceedings of the ACM on Human-Computer Interaction 1 (2017) 1–22. [21] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive language detection in online user content, in: Proceedings of the 25th international conference on world wide web, 2016, pp. 145–153. [22] F. Rangel, A. Giachanou, B. H. H. Ghanem, P. Rosso, Overview of the 8th author profiling task at pan 2020: Profiling fake news spreaders on twitter, in: CEUR Workshop Proceedings, volume 2696, Sun SITE Central Europe, 2020, pp. 1–18. [23] F. Rangel, P. Rosso, M. Montes-y Gómez, M. Potthast, B. Stein, Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter, Working Notes Papers of the CLEF (2018) 1–38. [24] B. G. Patra, K. G. Das, D. Das, Multimodal author profiling for twitter, Notebook for PAN at CLEF (2018). [25] D. I. Hernandez Farias, V. Patti, P. Rosso, Irony detection in twitter: The role of affective content, ACM Transactions on Internet Technology 16 (2016). [26] L. Dürlich, Kluenicorn at semeval-2018 task 3: A naive approach to irony detection, in: Proceedings of The 12th International Workshop on Semantic Evaluation, 2018, pp. 607–612. [27] U. B. Baloglu, B. Alatas, H. Bingol, Assessment of supervised learning algorithms for irony detection in online social media, in: 2019 1st International Informatics and Software Engineering Conference (UBMYK), IEEE, 2019, pp. 1–5. [28] S. M. Mohammad, P. Sobhani, S. Kiritchenko, Stance and sentiment in tweets, ACM Transactions on Internet Technology (TOIT) 17 (2017) 1–23. [29] M. Hu, B. Liu, Mining and summarizing customer reviews, in: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168–177. [30] S. M. Mohammad, P. D. Turney, Crowdsourcing a word–emotion association lexicon, Computational intelligence 29 (2013) 436–465. [31] T. Wilson, J. Wiebe, P. Hoffmann, Recognizing contextual polarity in phrase-level sentiment analysis, in: Proceedings of human language technology conference and conference on empirical methods in natural language processing, 2005, pp. 347–354. [32] J. Hong, C. A. Mattmann, P. Ramirez, Ensemble maximum entropy classification and linear regression for author age prediction, in: 2017 IEEE International Conference on Information Reuse and Integration (IRI), IEEE, 2017, pp. 509–516. [33] A. Rajadesingan, H. Liu, Identifying users with opposing opinions in twitter debates, in: International conference on social computing, behavioral-cultural modeling, and prediction, Springer, 2014, pp. 153–160. [34] U. N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect community structures in large-scale networks, Physical review E 76 (2007) 036106. [35] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo- Cespedes, S. Yuan, C. Tar, et al., Universal sentence encoder for english, in: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, 2018, pp. 169–174. [36] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [37] J. Rabelo, M.-Y. Kim, R. Goebel, M. Yoshioka, Y. Kano, K. Satoh, Coliee 2020: methods for legal document retrieval and entailment, in: JSAI International Symposium on Artificial Intelligence, Springer, 2020, pp. 196–210. [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [39] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [40] H. Choi, J. Kim, S. Joe, Y. Gwon, Evaluation of bert and albert sentence embedding performance on downstream nlp tasks, in: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp. 5482–5487. [41] C. McCormick, N. Ryan, Bert word embeddings tutorial, URL: https://mccormickml. com/2019/05/14/BERT-word-embeddings-tutorial (2019). [42] S. Qaiser, R. Ali, Text mining: use of tf-idf to examine the relevance of words to documents, International Journal of Computer Applications 181 (2018) 25–29. [43] A. A. Hakim, A. Erwin, K. I. Eng, M. Galinium, W. Muliady, Automated document classifi- cation for news article in bahasa indonesia based on term frequency inverse document frequency (tf-idf) approach, in: 2014 6th international conference on information technol- ogy and electrical engineering (ICITEE), IEEE, 2014, pp. 1–4. [44] A. Gaydhani, V. Doma, S. Kendre, L. Bhagwat, Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach, arXiv preprint arXiv:1809.08651 (2018). [45] F. Pedregosa, et al., sklearn. feature_extraction. text. tfidfvectorizer (2013). [46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.