=Paper=
{{Paper
|id=Vol-2421/IroSvA_overview
|storemode=property
|title=Overview of the Task on Irony Detection in Spanish Variants
|pdfUrl=https://ceur-ws.org/Vol-2421/IroSvA_overview.pdf
|volume=Vol-2421
|authors=Reynier Ortega-Bueno,Francisco Rangel,Delia Irazú Hernández Farías,Paolo Rosso,Manuel Montes-y-Gómez,José Medina-Pagola
|dblpUrl=https://dblp.org/rec/conf/sepln/BuenoRFRMM19
}}
==Overview of the Task on Irony Detection in Spanish Variants==
Overview of the Task on Irony Detection in Spanish Variants Reynier Ortega-Bueno1 , Francisco Rangel2,3 , Delia Irazú Hernández Farı́as4 , Paolo Rosso3 , Manuel Montes-y-Gómez4 , and José E. Medina-Pagola5 1 Center for Pattern Recognition and Data Mining, University of Oriente, Cuba reynier.ortega@cerpamid.co.cu 2 Autoritas Consulting, S.A., Spain francisco.rangel@autoritas.es 3 PRHLT Research Center, Universitat Politècnica de València, Spain prosso@dsic.upv.es 4 Laboratorio de Tecnologı́as del Lenguaje, Instituto Nacional de Astrofı́sica, Óptica y Electrónica (INAOE), Mexico dirazuherfa@inaoep.mx, mmontesg@inaoep.mx 5 University of Informatics Science, Havana, Cuba jmedinap@uci.cu Abstract. This paper introduces IroSvA, the first shared task fully ded- icated to identify the presence of irony in short messages (tweets and news comments) written in three different variants of Spanish. The task consists in: given a message, automatic systems should recognize whether the message is ironic or not. Moreover, with respect to the previous tasks on irony detection, the messages are not considered as isolated texts but together with a given context (e.g. a headline or a topic). The task is com- prised by three different subtasks: i) irony detection in tweets from Spain, ii) irony detection in Mexican tweets, and iii) irony detection in news comments from Cuba. These subtasks aim at studying the way irony changes across the distinct Spanish variants. We received 14 submissions from 12 teams. Participating systems were evaluated against the test dataset using F1 macro averaged. The highest classification scores ob- tained for the three subtasks are F1=0.7167, F1=0.6803, and F1=0.6596, respectively. Keywords: Irony Detection · Spanish Variants · Spanish datasets · Cross-variant 1 Introduction From its birth in the Ancient Greek to the present times irony has been a complex, controversial, and intriguing issue. It has been studied from many Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem- ber 2019, Bilbao, Spain. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) disciplines such as philosophy, psychology, rhetoric, pragmatics, semantics, etc. However, irony is not only enclosed to specialized theoretical discussions, this phenomenon appears in everyday conversations. As human beings, we appeal to irony for expressing in effective way something distinct to what we utter. Thus, understanding irony speech requires a more complex set of cognitive and lin- guistics abilities than direct and literal speech [1]. Despite that the term seems familiar for all of us, the mechanics underlying in ironic communication continues to be a challenging issue. The benefits of detecting and understanding irony com- putationally, have caused that irony oversteps its theoretical and philosophical perspective, and attracted the attention of both artificial intelligence researchers and practitioners [64]. Although, a well-established definition of irony still lacks in the literature, many authors appear to agree with two points: i) by using irony, the author does not intend to communicate with what she appears to be putting forward, the real meaning is evoked implicitly and differs from what she utters; and, ii) irony is closely connected with the expression of a feeling, emotion, attitude, or evaluation [20,27,57]. Due to its nature, irony has important implications in natural language pro- cessing tasks, and particularly in those that require semantic processing. A rep- resentative case is the well-known task of sentiment analysis which aims at auto- matically assess the underlying sentiments expressed in a text [42,50]. Interesting evidences about the impact of irony in sentiment analysis have been widely dis- cussed in [7,26,30,44,58]. Systems dedicated to sentiment analysis struggle when facing ironic texts because the intentional meaning of the text is expressed im- plicitly. Taking into account words and statistical information derived from text is not enough to deal with the sentiment expressed when ironic devices are used for communication purposes. Therefore, the systems require to recall contextual, commonsense, and world-knowledge for disentangling the right meaning. Indeed, in sentiment analysis irony plays a role of “implicit valence shifter ”, and ignoring it, cause an abrupt drop in systems’ accuracy [58]. Automatic irony detection has gained popularity and importance in the re- search community, paying special attention to social media content in English. Several shared tasks have been proposed to tackle this issue, such as: SemEval 2018 Task 3 [63], SemEval 2015 Task 11 [21], and PAKDD 2016 contest6 . Also, parallel tasks have been proposed for addressing irony in Italian: SentiPOLC tasks at EVALITA in 2014 [5] and 2016 [3], IronITA task at EVALITA 2018 [15]. However, for Spanish, the availability of datasets is scarce, which limits the amount of research done for this language. In this sense, we propose a new task, IroSvA (Irony Detection in Spanish Vari- ants), which aims at investigating whether a short message, written in Spanish language, is ironic or not with respect to a given context. Particularly, we aim at studying the way irony changes in distinct Spanish variants. 6 https://pakdd16.wordpress.fos.auckland.ac.nz/technical-program/contests/ 230 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 1.1 Task Description The task consists in automatically classifying short messages from Twitter and news comments for irony. It is structured in three independent subtasks. – Subtask A: Irony detection in Spanish tweets from Spain. – Subtask B: Irony detection in Spanish tweets from Mexico. – Subtask C: Irony detection in Spanish news comments from Cuba The three subtasks are centered on the same objective: systems should de- termine whether a message is ironic or not according to a specified context (by assigning a binary value 1 or 0). The following examples present an ironic and non-ironic tweet from Spanish users, respectively: Given the context: The politician of the Podemos party, Pablo Iglesias, appears in the Hormiguero TV program teaching to Spanish people to change baby diapers (PañalesIglesias ) 1) (Sp.) @europapress Pues resulta que @Pablo Iglesias es el primer papá que cambia pañales (En.) @europapress It seems that @Pablo Iglesias is the first daddy that changes baby diapers. 2) (Sp.) Como autónomo, sin haber disfrutado prácticamente de dı́as de baja cuando nacieron mis hijos, y habiendo cambiado muchos más pañales que tú, te digo: eres tonto. (En.) A self-employed person, without having practically enjoyed days off when my children were born, and having changed many more diapers than you, I tell you: you are stupid. The main difference with previous tasks on irony detection at SemEval 2018 Task 3 and IronITA 2018 is that messages are not considered as isolated texts but together with a given context (e.g. a headline or a topic). In fact, the context is mandatory for understanding the underlying meaning of ironic texts. This task provided a first dataset manually annotated for irony in Spanish social media and news comments. Additionally, and in unofficial way, participants were asked to evaluate their systems in a cross-variant setting. That is, to test each trained model on the test datasets of the other two variants. For example, to train the model on the Mexican dataset and validate it on the Spanish and Cuban datasets (and so on for the rest). The participants were allowed to submit one run for each subtask (exceptionally, two runs). No distinction between constrained and unconstrained systems was made, but the participants were asked to report what additional resources and corpora they have used for each submitted run. 231 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 2 Automatic Irony Detection With the increasing in the use of social media, user-generated content in those platforms has been considered as an interesting source of data for studying the use of irony. Data coming from different platforms such as Amazon reviews [17], comments from debate sites such as 4forums.com [43], Reddit [66], and Twitter (it has been without doubts the most exploited one [35]) have been considered in order to detect irony. Such an interesting and challenging task has been tackled as a binary classification problem. Automatic irony detection has been addressed from different perspectives. Exploiting textual-based features from the text on its own (such as n-grams, punctuation marks, part-of-speech labels, among others) has been widely used for irony detection [12,16,25,41,53]. Irony is strongly related to subjective aspects, in such a way some approaches have been proposed in order to take advantage of affective information [4,27,29,57]. In a similar fashion, in [67] the authors proposed a transfer learning approach that takes advantage of sentiment analysis resources. Information regarding to the context surrounding a given comment has been exploited in order to determine whether or not it has an ironic intention [2,40,65]. There are some deep learning-based approaches for dealing with irony detection. Word-embeddings and convolutional neural networks have been exploited for capturing the presence of irony in social media texts [22,23,31,36,49,52]. As in other natural language processing tasks, most of the research carried out on irony detection has been done in English. Notwithstanding, there have been some efforts to investigate such figurative language device in other languages such as: Chinese [62], Czech [53], Dutch [41], French [38], Italian [9], Portuguese [12], Spanish [33], and Arabic [39]. The strong relation between irony detection and sentiment analysis has de- rived in the emergence of some evaluation campaigns focused on sentiment analy- sis where the presence of ironic content was considered to assess the performance of the participating systems. The 2014 [5] and 2016 [3] editions of SENTIPOLC (SENTIment POLarity Classification) in the framework of EVALITA included a set of ironic tweets written in Italian. A drop in the performance of the sys- tems in the task was observed when ironic instances are involved, confirming the important role of irony for carrying out sentiment analysis. In 2015, the first shared task dedicated to sentiment analysis on figurative language devices in Twitter [21] was organized. The first shared task considering the presence of ironic content with sentiment analysis in Twitter data written in French was organized in 2017 [6]. The participating systems proposed supervised methods to address the task by taking advantage of standard classifiers together with n-grams, word-embeddings, as well as lexical resources. In a similar fashion, in [28] the authors proposed a pipeline approach that incorporates two modules: one for irony detection and the other one for polarity assignment. In addition to this, some shared tasks fully dedicated to irony detection have been organized. On 2018, in the framework of SemEval-2018 the first shared task aimed to detect irony in Twitter was organized (SemEval-2018 Task 3: 232 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Irony Detection in English Tweets) [63]. The task is composed by two subtasks: i) to determine whether a tweet is ironic or not (Task A), and ii) to identify which type of irony is expressed (Task B). The participating systems used a wide range of features (such as n-grams, syntactic, sentiment-based, punctua- tion marks, word-embeddings, among others) together with different classifica- tion approaches: ensemble-based classifiers, Logistic Regression (LR), Support Vector Machines (SVMs), as well as Long Short Term Memory Neural Networks (LSTMs), Convolutional Neural Networks (CNNs), and Recurrent Neural Net- works (RNNs). A shared task on irony detection in Italian tweets, denoted as IronITA, was organized in the framework of the EVALITA 2018 evaluation cam- paign [15]. Two subtasks were proposed: i) determining the presence of irony, and ii) identifying different types of irony (with a special attention to recog- nize instances expressing sarcasm7 ). Traditional classifiers (such as SVM and Naive Bayes (NB)) as well as deep learning techniques were used for addressing irony detection. Word-embeddings, n-grams, different lexical resources, as well as stylistic and structural features were exploited to characterize the presence of ironic intention. At the moment, a new shared task on irony detection in Ara- bic tweets (IDAT 2019)8 has been organized. The aim of the competition is to determine whether or not an Arabic tweet is ironic. IDAT task provides a useful evaluation framework for comparing the performance of Arabic irony detection methods with respect to those results reported in recent shared tasks. Analyzing the differences among diverse types of ironic devices has been also investigated. In the framework of SemEval-2018 Task 3 and IronITA-2018 subtasks aimed to identify ironic instances in a finer-grained way. In [4] the authors attempted to distinguish between ironic and sarcastic tweets. An analysis on the multi-faceted affective information expressed in tweets labeled with ironic hashtags (#irony, #sarcasm, and #not) was carried out in [61] where the authors identified some interesting differences among such figurative linguistic devices. However, it has been recognized that such a challenging task is still very difficult [15,63]. 3 Datasets Description In this section we describe the datasets proposed for evaluation, how they were collected, the labeling process and the inter-annotator agreement (IAA). 3.1 Annotation Guidelines For creating our multi-variant dataset for irony detection in short messages writ- ten in Spanish language we decided not to use any kind of standard guideline in 7 From a computational linguistics perspective, irony is often considered as an um- brella term covering sarcasm. However, there are theoretical foundations on the sepa- ration of both concepts. Sarcasm involves a negative evaluation towards a particular target with the intention to offend [1] 8 https://www.irit.fr/IDAT2019/ 233 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) the annotation process. However, two important aspects were considered: i) the annotators for each variant must be native speakers, and they do not annotate messages in other Spanish variants different from theirs. For instance, Mexi- can annotators do not label messages from Cuba and Spain. This constraint was defined because there are significant bias in irony and sarcasm labeling when cul- tural and social knowledge are required to understand the underlying meaning of messages [34]; ii) we asked annotators for labeling each message as ironic/non- ironic, given an specific “context”, based only on their own concept of irony. They made use of their own world-knowledge and linguistics skills. Also, no dif- ferentiation among any type of irony (situational, dramatically or verbal) was made; in the case of sarcasm, the annotators assumed that it is a special case of irony. 3.2 Cuban Variant In Cuba, the popularity of the social platforms (Twitter, Facebook, WhatsApp, etc.) is now increasing due to the technological advances in the communication sector, however the number of people that actually access to them continues to be limited with respect to other countries such as Mexico or Spain. For this reason, it is difficult to retrieve many tweets posted by Cuban users. As an alternative to this problem, we aim to explore other sources with similar characteristics. In particular, we envisaged the news comments as an interesting textual genre that shares characteristics with tweets. To collect the news comments were identified three popular Cuban news sites (Cubadabate9 , OnCuba10 , CubaSı́11 ). In concordance with the idea presented in [56], we had the intuition that some topics or headlines are more controversial than others and they generate major discussion threats. In this scenario, the readers spontaneously express their judgments, opinions, and emotions about the discussed news. This enables the possibility to obtain diverse points of view about the same topic, where irony device is often used. In this way, we manually chose 113 polemic headlines about social, economic, and political issues concerning Cuban people. We noted that those news with a fast and huge increase in the number of comments is correlated with controversial topics. This observation helped us to increase the speed of the selection process. Afterwards, the 113 headlines were grouped manually in 10 coarse topics which can be considered as context: – Digital Television, TV Decoders, Cuban Television and Audiovisuals (Digi- talTV). – Sports Scandals, Cuban National Baseball League and Football (Sports). – ETECSA, Quality and Service (E-Quality). – ETECSA, Internet, and Mobile Data (E-Mobile). – Transport, Bus Drivers, Taxi Drivers, Buses and Itineraries (Transport). 9 http://www.cubadebate.cu/ 10 https://oncubanews.com/ 11 http://cubasi.cu/ 234 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) – Advanced Technologies and Computerization of Society (TechSociety). – Intra-Cuban Trade, Prices, Shops and Markets (IC-Trade). – Economy, Hotels and Tourism (Economy). – Science, Education and Culture (Science). – Others. Once we defined both the topics and the headlines, we extracted and filtered all the comments. News comments do not have any restriction about the maxi- mum number of characters as imposed by Twitter. With the purpose of providing a dataset with short messages like tweets, we filtered out text with more than 300 characters. A final dataset composed of 5507 comments was obtained. The annotation process over the dataset was performed by three annotators simultaneously. All of them having a degree in Linguistics. In a first stage, only 100 instances were labeled by the three annotators. Based on them, an initial IAA was computed in terms of Cohen’s Kappa κ, between pairs of annotators; the averaged value was κ = 0.39. All cases of disagreement were discussed in order to establish a consensus in the annotation process. Later, a second stage of annotation was carried out, all instances, including the previous ones, were labeled by the annotators. At this time, an averaged κ = 0.67 was reached. This value reflects a good agreement and it is close to the results achieved in [63] for the English language. Finally, we considered as “ironic”/“non-ironic” instances those in which at least two annotators agreed, respectively. Considering this criterion we obtained a corpus with 1291 and 4216 “ironic”/“non-ironic” comments respectively. The official dataset to be provided for evaluation purposes consists of 3000 news comments distributed across the 9 distinct topics. We do not consider the topic “Others” because it is very broad and no “context” was provided for it. Then, the data were divided into two partitions considering the 80% for training and the rest for the test. Table 1 shows the distribution of comments for each topic in the training and test data. Table 1: Training and Test partitions distribution on the Cuban variant data. Training Test Topic Ironic Non-ironic Ironic Non-ironic DigitalTV 137 275 32 65 Sports 108 219 28 55 E-Quality 100 201 25 51 E-Mobile 92 185 23 47 Transport 91 184 23 46 TechSociety 85 172 22 44 IC-Trade 74 150 19 38 Economy 57 103 14 26 Science 56 111 14 28 Total 800 1600 200 400 235 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 3.3 Mexican Variant In a first attempt to build a dataset of tweets written in Mexico, we tried to col- lect ironic data from Twitter by applying a well-known strategy, i.e., by relying on the users’ intent to self-annotate their tweets using specific hashtags: “#iro- nia” and “#sarcasmo” (“#irony” and “#sarcasm”, respectively). However, we were able to retrieve only a few tweets with such a methodology, i.e., it seems that those labels are not commonly used by Twitter users in Mexico in order to self-annotate their intention of being ironic. Thus, an alternative approach was followed. We have the intuition that, in controversial tweets generated by accounts with solid reputation for information disclosure, Twitter users express their opinions about a certain topic. In this way, it is possible to capture dif- ferent points of view (including of course ironic statements) around the same topic. In other words, we are establishing a “context” in which a set of tweets are generated. First, we selected a set of Twitter accounts belonging to well-known jour- nalists, newspapers, newsmedia, and alike. In the second step, we defined nine controversial topics in Mexico to be considered as “context”: – Divorce of the Former President of Mexico Enrique Peña Nieto (DivorceEPN). – “Rome” movie during the Academy Awards 2019 (RomeMovie). – Process of selection of the head of the Mexico’s Energy Regulatory Commis- sion (CRE). – Fuel shortage occurred in Mexico in January 2019 (F-Shortage). – Funding cuts for children day-care centers (Ch-Centers). – Issues related to the new government in Mexico (GovMexico). – Issues related to the new government in Mexico City (GovCDMX). – Issues related to the National Council of Science and Technology (CONA- CYT). – Issues related to the Venezuela government (Venezuela). Once defined both the accounts and topics, we manually collected a set of tweets regarding the aforementioned topics posted by the selected accounts. A total of 54 tweets (denoted as tweetsForContext) were used as a starting point in order to retrieve the data. Then, for each tweet in tweetsForContext we retrieved those tweets posted as answers to the tweet in hand. The final step consisted in to filter out those instances composed by less than four words and also those con- taining only emojis, links, hashtags or mentions. Additionally, with the intention of having a topic in common with the context considered in the data collected in Spain, we also consider one more topic: “People supporting the Flat Earth Theory” (FlatEarth). The data belonging to this theme were retrieved according to two criteria: i) by exploiting specific terms to perform Twitter queries: “tier- raplanistas” and “tierra plana”; and ii) by verifying that the geo-localization of the tweets corresponds to any place in Mexico. The final set of collected tweets is composed by 5442 instances. We perform an annotation process over the retrieved data involving three people. We did not provide any kind of guideline for annotation purposes. Instead, we ask the 236 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) annotators to rely on their own definition of what irony is. In the first stage, the data were annotated by two independent annotators. Then, only for those instances where a disagreement exists, we asked for a third annotation. The inter- annotator agreement in terms of Cohen’s kappa between the first two annotators is κ = 0.1837 (this value reflects a slight agreement). The obtained IAA value validates the inherent complexity involved in the annotation of ironic data [54]. After the first annotation, we achieved a total of 771 ironic tweets. The second stage of the annotation process involved 2015 instances that were labeled by the third annotator. Finally, a set of 1180 tweets were annotated as ironic while 4262 as non-ironic. The final dataset to be provided for evaluation purposes consists of 3000 tweets distributed across the 10 different topics. Then the data were divided into two partitions considering the 80% for training and the rest for test. Table 2 shows the distribution of tweets for each topic in the corresponding data partition. Table 2: Training and Test partitions distribution on the Mexican variant data. Training Test Topic Ironic Non-ironic Ironic Non-ironic DivorceEPN 46 90 11 23 RomeMovie 78 161 19 41 CRE 123 146 31 37 F-Shortage 111 128 28 33 Ch-Centers 80 159 21 40 GovtMexico 114 125 29 32 GovCDMX 54 156 14 39 CONACYT 139 210 33 49 Venezuela 20 220 5 55 FlatEarth 35 205 8 52 Total 800 1600 199 401 3.4 Spanish Variant For building the Spanish dataset a similar process to the Cuban and Mexican variants was adopted. Guided by the idea that controversial and broad discussed topics are a potential source of spontaneous content where several points of view are exposed about a particular topic, resulting this scenario an attractive way for capturing figurative language usages such as irony. Firstly, a set of 10 controversial topics for Spanish users were identified. For each topic, several queries were defined with the purpose of retrieval messages from Twitter about the same topic. Table 3 shows the query terms and the topics defined. After that, all tweets were manually labeled by two annotators. In this case, the annotators labeled tweets until the amount of 1000 and 2000 ironic and non- ironic was reached. For this dataset, the Cohen’s Kappa κ was not computed, because only those tweets in which both annotators agreed the corresponding label was assigned. 237 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Table 3: Topics and query terms defined in the Spanish dataset variant. Topic Description/Query Terms Tardà Declaration of the Catalan politician in the procès trial. (joan tardá) Relator Relator (teller or rapporteur) figure to mediate in negotiations between the Spanish government and Catalonia. (relator) LibroSánchez Launching of the book “I will resist” (Resistiré) written by Presi- dent Pedro Sánchez. (Pedro Sánchez & libro), (@sanchezcastejon & libro) Franco Exhumation process of the dictator Franco from Valle de los Caı́dos (exhumación & franco) Grezzi Valencian politician of Mobility. (Grezzi) SemáforosA5 Start-up of traffic lights on the A5 motorway entering Madrid. (#semaforosA5 ) TierraPlanistas Referring to the current tendency of freethinkers in favor of Earth is flat. (tierraplanistas & tierra plana) VenAcenar Reality show where a group of people alternate in whose house to dine, episode 289. (#VenACenar289 ) YoconAlbert Hashtag of the political campaign of Albert Rivera, member of the Citizens party, applying for the presidency. (#yoconalbert) PañalesIglesias The politician Pablo Iglesias of Podemos party appears in the Hormiguero TV program teaching Spaniards to change diapers. (@Pablo Iglesias AND pañales) The official dataset to be provided for evaluation purposes consists of 3000 tweets distributed across the 10 distinct topics. Then, the data were divided into two partitions considering the 80% for training and the rest for test. Table 4 shows the distribution of tweets for each topic in the training and test data. Table 4: Training and Test partitions distribution on the Spain variant data. Training Test Topic Ironic Non-ironic Ironic Non-ironic Tardà 32 240 8 64 Relator 112 75 19 15 Librosánchez 162 90 19 12 Franco 52 240 10 86 Grezzie 54 182 20 36 SemáforosA5 48 215 12 54 Tierraplanistas 86 191 31 40 Venacenar 91 113 19 29 Yoconalbert 55 150 12 38 PañalesIglesias 108 104 50 26 Total 800 1600 200 400 238 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 4 Evaluation Measures and Baselines As we consider the task of irony detection as a binary classification problem, we used the standard metrics for evaluating a classifier performance. For the three subtasks, participating systems were evaluated using precision, recall and F1 measure, calculated as follows: #correct classif ied P recisionclass = (1) #total classif ied #correct classif ied Recallclass = (2) #total instances P recisionclass × Recallclass F 1class = 2 × (3) P recisionclass + Recallclass The metrics will be calculated per class label and macro-averaged. The sub- missions were ranked according to F1-Macro. This overall metric implies that all class labels have equal weight in the final score, resulting interesting in im- balanced datasets. Participating teams were restricted to submit only one run for each subtask. In order to assess the complexity of the task per language variant and the performance of the participants’ approaches, we propose the following baselines: – BASELINE-majority. A statistical baseline that always predicts the majority class in the training set. In case of balanced classes, it predicts one of them. – BASELINE-word n-grams, with values for n from 1 to 10, and selecting the 100, 200, 500, 1000, 2000, 5000, and 10000 most frequent ones. – BASELINE-W2V [46,47]. Texts are represented with two word embedding models: i) Continuous Bag of Words (CBOW); and ii) Skip-Grams. – BASELINE-LDSE [55]. This method represents documents on the basis of the probability distribution of occurrence of their words in the different classes. The key concept of LDSE is a weight, representing the probability of a term to belong to one of the different categories: human/bot, male/female. The distribution of weights for a given document should be closer to the weights of its corresponding category. LDSE takes advantage of the whole vocabulary. For all the methods we have experimented with several machine learning algorithms (below) and will report in the following the best performing one in each case. For each method we used the default parameters setting provided by Weka tool12 . – Bayesian methods: Naive Bayes, Naive Bayes Multinomial, Naive Bayes Multinomial Text, Naive Bayes Multinomial Updateable, and BayesNet. 12 https://www.cs.waikato.ac.nz/ml/index.html 239 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) – Logistic methods: Logistic Regression and Simple Logistic. – Neural Networks: Multilayer Perceptron and Voted Perceptron. – Support Vector Machine. – Rule-based method: Decision Table. – Trees: Decision Stump, Hoeffding Tree, J48, LMT, Random Forest, Random Tree, and REP Tree. – Lazy method: KStar. – Meta-classifiers: Bagging, Classification via Regression, Multiclass Classifier, Multiclass Classifier Updateable, and Iterative Classifier Optimize. Finally, we have used the following configurations: – BASELINE-word n-grams: • CU: 10000 words 1-grams + SVM • ES: 200 words 1-grams + BayesNet • MX: 2000 words 1-grams + SVM – BASELINE-W2V: • CU: Fasttext-Wikipedia + Logistic Regression • ES: Fasttext-Wikipedia + Voted Perceptron • MX: Fasttext-Wikipedia + BayesNet – BASELINE-LDSE: • CU: LDSE.v1 (MinFreq=10, MinSize=1) + Random Forest • ES: LDSE.v2 (MinFreq=5, MinSize=2) + SVM • MX: LDSE.v1 (MinFreq=2, MinSize=2) + BayesNet 5 Participating Systems A total of 12 teams participated simultaneously in the three subtasks (A,B, and C) on binary irony classification. Table 5 shows each team’s name, institutions and country. As can be observed in the table, teams from five countries where motivated by the challenge, specifically 4 teams from Spain, 3 teams from Mex- ico, 3 teams from Italy, one team from Cuba, and another from Brazil. Generally, the participating systems employed machine learning-based ap- proaches ranging from traditional classifiers (being the SVM the most popular one) to complex neural network architectures [24,59]; only one approach [13] addressed the challenge by using a pattern matching strategy, and one more exploited the impostor method [60]. Regarding the features used, we identified word embeddings (different models were employed such as Word2Vec, FastText, Doc2Vec, Elmo, and Bert) [19] as well as n-grams (in terms of words and char- acters) [32,48]. Only a few approaches took advantage of affective and stylistic features [10,18]. It is worthy to notice the use of features extracted from universal syntactic dependencies [14], which proved to be useful for detecting irony. Although we suggested to consider the given context for identifying irony, only three approaches took it into account [10,18,32]. In general, no strong ev- idence was shed about the impact of context for understanding irony on short 240 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Table 5: Paricipating teams Team name Institution Country ELiRF-UPV Universitat Politècnica de València (UPV) Spain CIMAT Mathematics Research Center (CIMAT), Mexico JZaragoza Universitat Politècnica de València (UPV), Spain ATC Università degli Studi di Torino (UniTO) Italy CICLiku Computing Research Center, National Mexico Polytechnic Institute (CIC-IPN) LabGeoCi Mathematics Research Center (CIMAT), Mexico Center for Research in Geospatial Informa- tion Sciences A.C. (CentroGeo) SCoMoDI Università degli Studi di Torino (UniTO) Italy LaSTUS/TALN Universitat Pompeu Fabra (UPF), Spain VRAIN Universitat Politècnica de València (UPV) Spain Aspie96 Università degli Studi di Torino (UniTO) Italy UO Center for Pattern Recognition and Data Cuba Mining (CERPAMID) UFPelRules Universidade Federal de Pelotas (UFPel) Brazil Spanish messages. We are aware that modeling the context is still really diffi- cult. Moreover, when we compare constrained systems to unconstrained ones, we noted that only two systems included additional data. Table 6 shows the performance of each participant in terms of F1 in each subtask and F1-Average (AVG) according to all subtasks. Systems were ranked according to the last global score F1-Average. As can be observed in Table 6, all systems outperform the Majority class baseline, six overpass the Word N-gram baseline whereas three systems achieved better results than the Word2Vec base- line and only two outperform LDSE baseline. The last mentioned two baselines clearly perform well in the three subtasks and generally they can be considered as strong. Below we discuss the top five best-performing teams, which all built a con- strained (i.e., only the provided training data were used) and supervised system. The best system, developed by [24] [26], achieved an AVG= 0.6832. Their pro- posal computes vector representations combining the encoder part of a Trans- former Model and word embeddings extracted from a skip-gram model trained with the 87 million tweets by using Word2Vec tool [45]. The messages were rep- resented in a d -dimensional fixed embedding layer, which was initialized with the weights of the word embedding vectors. After that, transformer encoders are applied relaying on the multi-head scaled dot-product attention. A global average pooling mechanism was applied to the output of the last encoder, that it is used as input to a feed-forward neural network, with only one hidden layer, whose output layer computes a probability distribution over the the two classes of each subtask. In the top five systems it is possible to find also the teams CIMAT [48] (AVG=0.6585), JZaragoza (AVG=0.6490), ATC (AVG=0.6302) [14], and CICLiku 241 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Table 6: Macro F-measure per language and global ranking as average per vari- ant. Ranking Team CU ES MX AVG 1 ELiRF-UPV 0.6527 0.7167 0.6803 0.6832 2 CIMAT 0.6596 0.6449 0.6709 0.6585 * BASELINE-LDSE 0.6335 0.6795 0.6608 0.6579 3 JZaragoza 0.6163 0.6605 0.6703 0.6490 * BASELINE-W2V 0.6033 0.6823 0.6271 0.6376 4 ATC 0.5941 0.6512 0.6454 0.6302 5 CICLiku 0.5621 0.6875 0.641 0.6302 6 LabGeoCi 0.6396 0.6251 0.6121 0.6256 * BASELINE-word n-grams 0.5684 0.6696 0.6196 0.6192 7 SCoMoDI 0.6338 0.6652 0.5574 0.6188 8 LASTUS-UPF method1 0.6017 0.6606 0.5933 0.6185 9 VRAIN 0.5204 0.6842 0.6476 0.6174 10 LASTUS-UPF method2 0.5737 0.6493 0.6218 0.6149 11 Aspie96 0.5388 0.5935 0.5747 0.5690 12 UO run2 0.5930 0.5445 0.5353 0.5576 13 UFPelRules 0.5620 0.5088 0.5464 0.5391 14 UO 0.4996 0.5110 0.4890 0.4999 BASELINE-majority 0.4000 0.4000 0.4000 0.4000 Min 0.4996 0.5088 0.4890 0.4999 Q1 0.5620 0.6014 0.5617 0.5805 Median 0.5936 0.6502 0.6169 0.6187 Mean 0.5891 0.6288 0.6061 0.6080 SDev 0.0492 0.0653 0.0584 0.0496 Q3 0.6294 0.6641 0.6471 0.6302 Max 0.6596 0.7167 0.6803 0.6832 Skewness -0.2438 -0.8078 -0.4794 -0.7328 Kurtosis 2.0663 2.4494 2.1610 2.8608 Normality (p-value) 0.9119 0.0343 0.5211 0.0984 [10] (AVG=0.6302). The CIMAT system considers vectors by concatenating fea- tures built from three distinct representations: i) based on words embeddings leaned by Word2Vec on huge corpus, ii) based on a deep representation leaned by LSTMs neural networks, and iii) based on n-grams at character and word level. The first representation uses traditional pre-trained Word2Vec and average the word vectors of the tokens contained in each document. The second considers only the last hidden state of an LSTMs with 256 units. The third is a set of 2-3-4 grams at character and word levels, which are selected (the top 5000) by using the Chi-square metric implemented in sklearn tool13 . All representations were concatenated and fed into a SVM with a linear kernel. The third best system presented by the team JZaragoza addressed the chal- lenge by using a character and word-based n-grams representation and a SVM as classifier with a radial kernel. The team ATC ranked fourth and it faced the task of irony detection by a shallow machine learning approach. The most salience and novel contribution is based on representing the messages by morphological and dependency-based features. It also worth noting that the proposed model trained a SVM on the three datasets altogether (7,200 texts) and tested the same model on the three different test sets, regardless of the three variants of Span- ish. The fifth best system was presented by the CICLiku team. The proposed 13 https://scikit-learn.org/ 242 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) model is based on embeddings based on the FastText [8] trained on Spanish Billion Words [11] and the emotion-levels as features, and AdaBoost M1 func- tion on Random Forest as classifier. Considering the role of affective information in irony detection, in this work the messages were represented by the six main emotions (love, joy, surprise, sadness, anger, and fear), with the particularity of taking into account intensities of such emotions learned from the text. The emo- tion based representation (with only six features) achieved competitive results compared with the embedding based representation. The remainder systems obtained results very close to the Word N-gram base- line. All of them except one (UFPelRules) tackled the irony detection task using supervised approaches, however the nature and complexity of their architectures and features vary significantly. The LabGeoCi proposal [19] uses a distributed representation of the texts; i.e., the deep contextualized word representations ELMo (Embeddings from Language Models)[51]. The SCoMoDI system [18] uses a SVM with radial kernel approach and stylistic, semantic, emotional, af- fective and lexical features. In [59] the LaSTUS/TALN team trained the models for the different languages simultaneously and considered data from other Iber- LEF 2019 shared tasks, as a technique for data augmentation. It uses word embeddings (FastText) built by using external data from the other IberLEF 2019 shared tasks; besides, it uses a neural network model based on a simple bidirectional LSTM (biLSTM) networks. The VRAIN system [32] uses vectors of counts of word n-grams and an en- semble of the SVM and Gradient Tree Boosting model. The work presented by the Aspie96 team addressed the task by using character-level neural net- work, representing each character as an array of binary flags. The network is composed of some convolutional layers, followed by a bidirectional GRU layer (BiGRUs). The UO team [13] uses an adaptation of the impostors method and bag-of-words, punctuation marks, and stylistic features for building vector rep- resentation. They submitted the results of two runs, the first one considering as features the token extracted by the Freeling NLP tokenizer, the second one con- sidering the lemmas extracted by the FreeLing tool14 . It is worth to notice that the UO team tackled the problem from one-class classification perspective (com- monly used for verification tasks). Finally, the last ranked system UFPelRules [37], which was the single unsupervised system, uses several linguistic patterns in order to trained the models: syntactic rules, static expressions, lists of laughter expressions, specific scores, and symbolic language. 6 Evaluation and Discussion of the Results In this section we present and discuss the results obtained by the participants. Firstly, we show the final ranking. Then, we analyse the error in the different variants. Finally, a cross-variant analysis is presented. 14 http://nlp.lsi.upc.edu/freeling/node/1 243 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 6.1 Global Ranking A total of 12 teams have participated in the shared task, submitting a total of 14 runs. In Table 6 the overall performance per language variant and users’ rank- ing are shown. The highest results have been obtained for the Spanish variant (0.7167), followed by the Mexican (0.6803) and the Cuban (0.6527) one. The best results for the Cuban variant have been obtained by [48]. The best results for the other variants, as well as the best average results, have been achieved by [24]. In average, the systems obtained better results for the Spanish variant (0.6288) than for the Mexican (0.6061) and Cuban (0.5891) ones. In the case of the Span- ish variant, the distribution is also narrower than for the other variants (see Fig- ure 1). This is reflected in their inter-quartile ranges (ES: 0.0627; MX: 0.0854; CU: 0.0674), although the standard deviation in the case of Spanish (0.0653) is higher than for the other variants (CU: 0.0492; MX: 0.0584). This is due to some systems with high performance (far from the average, albeit not enough to be considered as outliers) that stretch the median up with respect to the mean (ES: 0.6502 vs. 0.6288; CU: 0.5936 vs. 0.5891; MX: 61.69 vs. 0.6061). It can be observed in Figure 1 that the Spanish variant has two peaks, the highest one around 0.68 and the other one around 0.52. This is reflected in the ranking with two groups of systems with F-measures between 0.6251 and 0.7167, and between 0.5088 and 0.5445, respectively. Furthermore, the lowest p-value for this variant (0.0343) indicates a restraint from the normal distribution. Fig. 1: Distribution and density of the results in the different variants. 6.2 Results per Topic in each Variant In this section we analyse the achieved results per topic. We have aggregated all the systems predictions, except baselines, and calculated the F-measure per topic in each variant. Then, the distribution of F-measures have been plotted in Figures 2, 3, and 4 respectively for Cuba, Spain, and Mexico. Regarding Cuba, it can be observed that the topic with the systems perform- ing better refers to “Economy”, although with similar median than “E-Quality” 244 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) and “DigitalTV”. On the contrary, there are several topics where the systems performed worst, although with a different behaviour. For example, the median value is similar for “Sports” and “TechSociety”. Nevertheless, the sparsity is much higher in the last case, with even an outlier system which failed in most cases. Fig. 2: Distribution of results per topic in the Cuban variant. Regarding Spain, the topic where the systems performed better was “El Re- lator” (The Relator), with a high median and not very large inter-quartile range (sparsity). Furthermore, this is the topic with the highest F-measure, with a median about 0.75. The topic with the worst performance is “VenACenar” (the reality show), where there are also two outliers with F-measures close to 0.45. There are two topics with similar maximum, minimum and inter-quartile range, but with inverted medians: “Franco” and “YoconAlbert”. We can also highlight the obtained results in the “Tierraplanistas” (Flatearthers) topic due to its low sparsity: most systems behaved similarly, albeit the overall performance was not very high, contrary to what could be expected due to the topic. Fig. 3: Distribution of results per topic in the Spanish variant. 245 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Regarding Mexico, the topics with the highest performance are “Funding cuts for children day-care centers” and “CRE”, although the second one with lowest sparsity. The topic with the lowest performance is “Venezuela”, with average values around 0.50. Similar to the Spanish variant, the topic with the lowest sparsity is ‘FlatEarth”, although the performance of the systems is higher in average (0.60 vs. 0.55), probably meaning that irony is easier to be identified in Mexico for this particular topic. Fig. 4: Distribution of results per topic in the Mexican variant. 6.3 Error Analysis We have aggregated all the participants’ predictions for the different variants, except baselines, and plotted the respective confusion matrices in Figures 5, 6 and 7, respectively for Cuba, Spain, and Mexico. In all the variants, the highest confusion is from Ironic to Non-Ironic texts (0.5338, 0.4963, and 0.5263 respec- tively for Cuba, Spain, and Mexico). As can be seen, the error is similar in the three variants, ranging from 0.4963 to 0.5338, a difference of 0.0375. Regarding the confusion from Non-Ironic to Ironic texts, the difference among variants is also similar (0.2761, 0.2357, and 0.2579), although with a slightly larger range of 0.0404. As a consequence, the highest results are obtained in the case of Ironic texts (0.7239, 0.7643, and 0.7421, respectively for Cuba, Spain, and Mexico), whereas they are significantly lower in case of Non-Ironic texts (0.4662, 0.5037, and 0.4737). As can be seen, in the case of Cuba and Mexico, the accuracy in Non-Ironic texts is below the 50%. 246 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Fig. 5: Aggregated confusion matrix for the Cuban variant Fig. 6: Aggregated confusion matrix for the Spanish variant Fig. 7: Aggregated confusion matrix for the Mexican variant 247 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 6.4 Cross-Variant Evaluation In this section we analyse the performance of the systems when they are trained on one variant and tested on a different one. Looking at Table 7, we can see that the highest performance was achieved by CIMAT when trained their system in the Cuban variant and tested it on the Spanish one (0.6106). Nevertheless, we can observe that the average performance is very similar in all cases (see Figure 8), ranging from 0.5078 in case of Spain to Cuba, to 0.5451 in case of Cuba to Mexico. Similarly, the median ranges from 0.5145 in case of Mexico to Cuba, to 0.5511 also in case of Cuba to Mexico. Table 7: Cross-variant raking. TEAM MX->ES CU->ES ES->MX CU->MX ES->CU MX->CU AVG JZaragoza 0.4904 0.5846 0.5734 0.5741 0.5216 0.5263 0.5451 ELiRF 0.5359 0.5442 0.5595 0.5733 0.4978 0.5585 0.5449 CIMAT 0.5070 0.6106 0.4944 0.5632 0.5187 0.5593 0.5422 LabGeoCi 0.5328 0.4825 0.5464 0.5663 0.5218 0.5648 0.5358 LASTUS-UPF method1 0.5350 0.5183 0.5329 0.5404 0.5225 0.4842 0.5222 CICLiku 0.5238 0.5551 0.5100 0.5502 0.4841 0.5028 0.5210 SCoMoDI 0.4677 0.5333 0.5599 0.5519 0.5062 0.4866 0.5176 VRAIN 0.5198 0.5086 0.5422 0.5034 0.4683 0.5485 0.5151 LASTUS-UPF method2 0.5176 0.4523 0.5516 0.5478 0.5207 0.4712 0.5102 UO 0.4626 0.3574 0.4891 0.4806 0.5166 0.4965 0.4671 Min 0.4626 0.3574 0.4891 0.4806 0.4683 0.4712 0.4671 Q1 0.4945 0.4890 0.5157 0.5423 0.4999 0.4891 0.5157 Median 0.5187 0.5258 0.5443 0.5511 0.5177 0.5145 0.5216 Mean 0.5093 0.5147 0.5359 0.5451 0.5078 0.5199 0.5221 SDev 0.0270 0.0720 0.0289 0.0306 0.0188 0.0358 0.0233 Q3 0.5305 0.5524 0.5575 0.5655 0.5214 0.5560 0.5406 Max 0.5359 0.6106 0.5575 0.5741 0.5225 0.5648 0.5451 Skewness -0.7474 -0.8930 -0.5195 -1.1419 -1.1158 0.0464 -1.2446 Kurtosis 2.1136 3.4012 1.9277 3.1084 2.8864 1.3756 4.1749 Normality (p-value) 0.1146 0.5322 0.2960 0.0517 0.0160 0.1798 0.1008 Looking at Figure 8, we can highlight the similar inter-quartile range (spar- sity) in case of Cuba to Spain (from 0.4890 to 0.5524), and in case of Mexico to Cuba (from 0.4891 to 0.5560), even with a small difference in their median (0.5258 vs. 0.5145). In Figure 9, the distribution of the results for the cross-variant scenario is shown without outliers. This reshapes the figures and highlights some insights. For example, in the case of systems tested on Spanish from Spain, they have sim- ilar median (0.5187 in case of Mexican as training set; 0.5258 in case of Cuban). However, the inter-quartile range is much higher in the second case (0.0634 vs. 0.0360). In the case of Mexico as test variant, the systems performed better when trained on the Cuban variant than on the Spanish one (0.5451 vs. 0.5359 in average; 0.5511 vs. 0.5443 in median), and also the sparsity is lower (0.0232 vs. 0.0418 in terms of inter-quartile range). Finally, with respect to Cuban as testing variant, the results are better with the Mexican variant as training in terms of maximum accuracy (0.5648 vs. 0.5225), Q3 (0.5560 vs. 0.5214) and mean (0.5199 vs. 0.5078). However, with the Spain variant as training the spar- sity is lower (0.0215 vs. 0.0669) as well as the median (0.5177 vs. 0.5145) is slightly higher. 248 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Fig. 8: Distribution of results cross-variant. Fig. 9: Distribution of results cross-variant (without outliers). 6.5 Intra-Variant vs. Cross-Variant In this section the obtained results are compared with the results in the cross- variant scenario. As can be seen, there is a considerable decrease in the perfor- mance for all the statistical variables, specially in the case of the best performing system where the F-measure decreases from 0.6832 to 0.5451 (a drop of 0.1381). Table 8: Statistics Intra-Variant vs. Cross-Variant. Statistics Intra-Variant Cross-Variant Diff Min 0.4999 0.4671 0.0328 Q1 0.5805 0.5157 0.0648 Median 0.6187 0.5216 0.0971 Mean 0.6080 0.5221 0.0859 SDev 0.0496 0.0233 0.0263 Q3 0.6302 0.5406 0.0896 Max 0.6832 0.5451 0.1381 Skewness -0.7328 -1.2446 0.5118 Kurtosis 2.8608 4.1749 -1.3141 Normality (p-value) 0.0984 0.1008 -0.0024 249 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) As can be seen in Figure 10, the intra-variant results are closer to a normal distribution with the average performance around 0.6080, whereas the cross- variant results contain two clear peaks, one around the median value of 0.5216 and the other one around the minimum value of 0.4671. Nevertheless, the sys- tems’ behavior in the cross-variant scenario is more homogeneous: most of them obtained results around the mean and their inter-quartile range is half (0.0249 vs. 0.0497). Fig. 10: Distribution and density of results intra-variant vs. cross-variant. 7 Conclusions This paper describes IroSvA (Irony Detection in Spanish Variants), the first shared task fully dedicated to irony detection in short messages written in Span- ish. The task was composed of three subtasks aiming to identify irony in user- generated content written by Spanish speaking users from Spain, Mexico, and Cuba. Unlike related competitions, participating systems to this task were asked to determine the presence of ironic content considering not only isolated texts but also the “context” to which each text belongs to. Datasets from each variant were developed considering diverse contexts according to controversial topics at each country. Aiming to investigate their performance in a cross-variant setting, the participating systems were asked to train their models in a given variant and evaluated it on the two remainings. A total of 12 teams participated in the shared task. Several approaches were proposed by participants, ranging from traditional strategies exploiting n-grams (at both word and character levels), stylistic and syntactic features to deep learn- ing models using different word embeddings representations (such as Word2Vec, FastText, and ELMo), convolutional layers, autoencoders, and LSTM. The per- formance of the systems was ranked considering as evaluation metric the F1- Average (it takes into account the F1 score obtained in each subtask). Overall, participating systems achieved a higher performance in F1 terms for the Spanish variant. The best-ranked team, ELiRF-UPV, achieved an F1-Average of 0.6832 250 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) by exploiting a deep learning-based approach. Regarding the cross-variant eval- uation, the best result (0.6106 in F1 terms) was obtained by CIMAT when their system was trained on the Cuban variant and then applied over the one coming from Spain. It is important to highlight that, the results achieved by the partic- ipating systems are similar to the ones obtained in other shared tasks on irony detection focused on different languages. Broadly speaking, IroSvA serves to establish a common framework for the evaluation of Spanish irony detection models. Furthermore, the datasets devel- oped for this task could serve to foster the research on irony detection when the instances are related to a defined context. Acknowledgments The work of the fourth author was partially funded by the Spanish MICINN un- der the research project MISMIS-FAKEnHATE on Misinformation and Miscom- munication in social media: FAKE news and HATE speech (PGC2018-096212- B-C31). The third and fifth authors were partially supported by CONACYT- Mexico project FC-2410. References 1. Attardo, S.: Irony as Relevant Inappropriateness. Journal of Pragmatics 32(6), 793–826 (2000). https://doi.org/10.1016/S0378-2166(99)00070-3 2. Bamman, D., Smith, N.A.: Contextualized Sarcasm Detection on Twitter. In: Pro- ceedings of the Ninth International Conference on Web and Social Media, ICWSM 2015. pp. 574–577. AAAI, Oxford, UK (2015) 3. Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., Patti, V.: Overview of the Evalita 2016 SENTIment POLarity Classification Task. In: Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Ital- ian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. CEUR Workshop Proceedings, vol. 1749. CEUR-WS.org (2016) 4. Barbieri, F., Saggion, H., Ronzano, F.: Modelling Sarcasm in Twitter, a Novel Approach. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. pp. 50–58. Association for Computational Linguistics, Baltimore, Maryland, USA (June 2014) 5. Basile, V., Bolioli, A., Nissim, M., Patti, V., Rosso, P.: Overview of the Evalita 2014 SENTIment POLarity Classification Task. In: Proceedings of the First Italian Conference on Computational Linguistics (CLiC-it 2014) & the Fourth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian EVALITA 2014. pp. 50–57 (2014) 6. Benamara, F., Grouin, C., Karoui, J., Moriceau, V., Robba, I.: Analyse d’Opinion et Langage Figuratif dans des Tweets : Présentation et Résultats du Défi Fouille de Textes DEFT2017. In: Actes de l’atelier DEFT2017 Associé à la Conférence TALN. Orléans, France (June 2017) 251 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 7. Bharti, S.K., Vachha, B., Pradhan, R.K., Babu, K.S., Jena, S.K.: Sarcastic Sentiment Detection in Tweets Streamed in Real Time: a Big Data Approach. Digital Communications and Networks (2016). https://doi.org/10.1016/j.dcan.2016.06.002 8. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. Transactions of the ACL. 5, 135–146 (2017) 9. Bosco, C., Patti, V., Bolioli, A.: Developing Corpora for Sentiment Analysis: The Case of Irony and Senti-TUT. IEEE Intelligent Systems 28(2), 55–63 (2013) 10. Calvo, H., Juárez-Gambino, O.: Emotion-Based Cross-Variety Irony Detection. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co- located with 34th Conference of the Spanish Society for Natural Language Process- ing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 11. Cardellino, C.: Spanish Billion Words Corpus and Embeddings (2016). [On- line]. Available: http://crscardellino.me/SBWCE/. Retrieved May 4, 2018, http://crscardellino.me/SBWCE/ 12. Carvalho, P., Sarmento, L., Silva, M.J., de Oliveira, E.: Clues for Detecting Irony in User-generated Contents: Oh...!! it’s “so easy” ;-). In: Proceedings of the 1st International Conference on Information Knowledge Management Workshop on Topic-Sentiment Analysis for Mass Opinion. pp. 53–56 (2009) 13. Castro, D., Benavides, L.: UO-CERPAMID at IroSvA: Impostor Method Adap- tation for Irony Detection. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 14. Cignarella, A.T., Bosco, C.: ATC at IroSvA 2019: Shallow Syntactic Dependency- based Features for Irony Detection in Spanish Variants. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Con- ference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 15. Cignarella, A.T., Frenda, S., Basile, V., Bosco, C., Patti, V., Rosso, P., et al.: Overview of the EVALITA 2018 Task on Irony Detection in Italian Tweets (IronITA). In: Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2018). vol. 2263, pp. 1–6. CEUR-WS (2018) 16. Davidov, D., Tsur, O., Rappoport, A.: Semi-supervised Recognition of Sarcastic Sentences in Twitter and Amazon. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning. pp. 107–116. CoNLL ’10, Association for Computational Linguistics, Uppsala, Sweden (2010) 17. Filatova, E.: Irony and Sarcasm: Corpus Generation and Analysis Using Crowd- sourcing. In: Proceedings of the Eighth International Conference on Language Re- sources and Evaluation (LREC-2012). pp. 392–398. European Language Resources Association (ELRA), Istanbul (May 2012) 18. Frenda, S., Patti, V.: SCoMoDI: Computational Models for Irony Detection in three Spanish Variants. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Nat- ural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR- WS.org, Bilbao, Spain (2019) 19. Garcı́a, L., Moctezuma, D., Muñiz, V.: A Contextualized Word Representation Approach for Irony Detection. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society 252 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 20. Garmendia, J.: Irony. Cambridge University Press, New York, USA, first edn. (2018). https://doi.org/10.1017/9781316136218 21. Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J., Reyes, A.: SemEval-2015 Task 11: Sentiment Analysis of Figurative Language in Twitter. In: Proceedings of the 9th International Workshop on Semantic Evaluation. pp. 470–478. Association for Computational Linguistics, Denver, Colorado (2015) 22. Ghosh, A., Veale, T.: Fracking Sarcasm using Neural Network. In: Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. pp. 161–169. Association for Computational Linguistics, San Diego, California (June 2016), http://www.aclweb.org/anthology/W16-0425 23. Ghosh, D., Richard Fabbri, A., Muresan, S.: The Role of Conversation Con- text for Sarcasm Detection in Online Interactions. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. pp. 186–196. As- sociation for Computational Linguistics, Saarbrücken, Germany (Aug 2017). https://doi.org/”10.18653/v1/W17-5523” 24. González, J.A., Hurtado, L.F., Pla, F.: ELiRF-UPV at IroSvA: Transformer En- coders for Spanish Irony Detection. In: Proceedings of the Iberian Languages Eval- uation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Pro- ceedings. CEUR-WS.org, Bilbao, Spain (2019) 25. González-Ibáñez, R., Muresan, S., Wacholder, N.: Identifying Sarcasm in Twitter: A Closer Look. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 581–586. HLT ’11, Association for Computational Linguistics, Portland, Oregon (2011) 26. Gupta, R.K., Yang, Y.: CrystalNest at SemEval-2017 Task 4: Using Sarcasm De- tection for Enhancing Sentiment Classification and Quantification. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 626–633. Association for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/S17-2103 27. Hernández Farı́as, D.I., Benedı́, J.M., Rosso, P.: Applying Basic Features from Sentiment Analysis for Automatic Irony Detection. In: Paredes, R., Cardoso, J.S., Pardo, X.M. (eds.) Pattern Recognition and Image Analysis, Lecture Notes in Com- puter Science, vol. 9117, pp. 337–344. Springer International Publishing, Santiago de Compostela, Spain (2015). https://doi.org/10.1007/978-3-319-19390-8 38 28. Hernández Farı́as, D.I., Bosco, C., Patti, V., Rosso, P.: Sentiment Polarity Classifi- cation of Figurative Language: Exploring the Role of Irony-Aware and Multifaceted Affect Features. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing. pp. 46–57. Springer International Publishing, Cham (2018) 29. Hernández Farı́as, D.I., Patti, V., Rosso, P.: Irony Detection in Twitter: The Role of Affective Content. ACM Trans. Internet Technol. 16(3), 19:1–19:24 (2016). https://doi.org/10.1145/2930663 30. Hernández Farı́as, D.I., Rosso, P.: Irony, Sarcasm, and Sentiment Analysis. In: Pozzi, F.A., Fersini, E., Messina, E., Liu, B. (eds.) Sentiment Analy- sis in Social Networks, pp. 113–128. Elsevier Science and Technology (2016), http://dx.doi.org/10.1016/B978-0-12-804412-4.00007-3 31. Huang, Y.H., Huang, H.H., Chen, H.H.: Irony Detection with Attentive Recurrent Neural Networks. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) Advances in Information Retrieval. pp. 534–540. Springer International Publishing, Cham (2017) 253 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 32. Iranzo-Sánchez, J., Ruiz-Dolz, R.: VRAIN at IroSvA 2019: Exploring Classical and Transfer Learning Approaches to Short Message Irony Detection. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 33. Jasso López, G., Meza Ruiz, I.: Character and Word Baselines Systems for Irony Detection in Spanish Short Texts. Procesamiento del Lenguaje Natural 56, 41–48 (2016), http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/5285 34. Joshi, A., Bhattacharyya, P., Carman, M.J.: Investigations in Computational Sar- casm. Springer Nature, Singapore (2018) 35. Joshi, A., Bhattacharyya, P., Carman, M.J.: Automatic Sarcasm De- tection: A Survey. ACM Comput. Surv. 50(5), 73:1–73:22 (Sep 2017). https://doi.org/10.1145/3124420 36. Joshi, A., Tripathi, V., Patel, K., Bhattacharyya, P., Carman, M.J.: Are Word Embedding-based Features Useful for Sarcasm Detection? In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November, 2016. pp. 1006–1011 (2016) 37. Justin Deon, D., de Freitas, L.A.: UFPelRules to Irony Detection in Spanish Vari- ants. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 38. Karoui, J., Benamara, F., Moriceau, V., Aussenac-Gilles, N., Hadrich-Belguith, L.: Towards a Contextual Pragmatic Model to Detect Irony in Tweets. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 644–650. Association for Computational Linguistics (July 2015) 39. Karouia, J., Zitoune, F.B., Veronique Moriceau: SOUKHRIA: Towards an Irony Detection System for Arabic in Social Media. In: 3rd International Conference on Arabic Computational Linguistics, ACLing 2017. pp. 161–168. Association for Computacional Linguistic (ACL), Dubai, United Arab Emirates (2017) 40. Khattri, A., Joshi, A., Bhattacharyya, P., Carman, M.: Your Sentiment Precedes You: Using an Author’s Historical Tweets to Predict Sarcasm. In: Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. pp. 25–30. Association for Computational Linguistics, Lisboa, Portugal (September 2015) 41. Kunneman, F., Liebrecht, C., van Mulken, M., van den Bosch, A.: Signaling Sar- casm: From Hyperbole to Hashtag . Information Processing & Management 51(4), 500 – 509 (2015) 42. Liu, B.: Sentiment Analysis and Opinion Mining, vol. 5. Morgan & Claypool Pub- lishers (2012). https://doi.org/10.2200/S00416ED1V01Y201204HLT016 43. Lukin, S., Walker, M.: Really? Well. Apparently Bootstrapping Improves the Per- formance of Sarcasm and Nastiness Classifiers for Online Dialogue. In: Proceedings of the Workshop on Language Analysis in Social Media. pp. 30–40. Association for Computational Linguistics, Atlanta, Georgia (June 2013) 44. Maynard, D., Greenwood, M.A.: Who Cares about Sarcastic Tweets ? Investigating the Impact of Sarcasm on Sentiment Analysis. In: Proceedings of the Ninth Inter- national Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (2014) 254 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 45. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. Nips pp. 1–9 (2013). https://doi.org/10.1162/jmlr.2003.3.4-5.951 46. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Rep- resentations in Vector Space. In: Proceedings of Workshop at International Con- ference on Learning Representations (ICLR’13) (2013) 47. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed Rep- resentations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems pp. 3111–3119 (2013) 48. Miranda-Belmonte, H.U., López-Monroy, A.P.: Early Fusion of Traditional and Deep Features for Irony Detection in Twitter. In: Proceedings of the Iberian Lan- guages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 49. Nozza, D., Fersini, E., Messina, E.: Unsupervised Irony Detection: A Probabilis- tic Model with Word Embeddings. In: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Man- agement. pp. 68–76 (2016). https://doi.org/10.5220/0006052000680076 50. Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foun- dations and Trends R in Information Retrieval 2(1-2), 1–135. https://doi.org/10.1561/1500000011 51. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle- moyer, L.: Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227– 2237. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202 52. Poria, S., Cambria, E., Hazarika, D., Vij, P.: A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pp. 1601–1612. Association for Computational Linguistics, Osaka, Japan (Dec 2016), https://www.aclweb.org/anthology/C16-1151 53. Ptáček, T., Habernal, I., Hong, J.: Sarcasm Detection on Czech and English Twit- ter. In: Proceedings of COLING 2014, the 25th International Conference on Com- putational Linguistics. pp. 213–223. Dublin City University and Association for Computational Linguistics, Dublin, Ireland (August 2014) 54. Rangel, F., Hernández Farı́as, D.I., Rosso, P.: Emotions and Irony per Gender in Facebook. In: Proc. Workshop on Emotion, Social Signals, Sentiment & Linked Open Data (ES3LOD), LREC-2014. pp. 68–73. Reykjavı́k, Iceland (2014) 55. Rangel, F., Rosso, P., Franco-Salvador, M.: A Low Dimensionality Representation for Language Variety Identification. In: 17th International Conference on Intelli- gent Text Processing and Computational Linguistics, CICLing’16. Springer-Verlag, LNCS(9624), pp. 156-169 (2018) 56. Reyes, A., Rosso, P.: Mining Subjective Knowledge from Customer Reviews: A Specific Case of Irony Detection. In: Proceedings of the 2nd Workshop on Compu- tational Approaches to Subjectivity and Sentiment Analysis. pp. 118–124. WASSA ’11, Association for Computational Linguistics, Stroudsburg, PA, USA (2011), http://dl.acm.org/citation.cfm?id=2107653.2107668 57. Reyes, A., Rosso, P., Veale, T.: A Multidimensional Approach for Detecting Irony in Twitter. Language Resources and Evaluation 47(1), 239–268 (2013) 255 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 58. Rosenthal, S., Ritter, A., Nakov, P., Stoyanov, V.: SemEval-2014 Task 9: Sentiment Analysis in Twitter. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). pp. 73–80. No. SemEval (2014) 59. Seda Mut Altin, L., Bravo, A., Saggion, H.: LaSTUS/TALN at IroSvA: Irony Detection in Spanish Variants. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2019). CEUR Workshop Proceedings. CEUR-WS.org, Bilbao, Spain (2019) 60. Seidman, S.: Authorship verification using the impostors method notebook for PAN at CLEF 2013. In: Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013. (2013), http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN- Seidman2013.pdf 61. Sulis, E., Hernández Farı́as, D.I., Rosso, P., Patti, V., Ruffo, G.: Fig- urative Messages and Affect in Twitter: Differences between #irony, #sarcasm and #not. Knowledge-Based Systems 108, 132 – 143 (2016). https://doi.org/10.1016/j.knosys.2016.05.035, new Avenues in Knowledge Bases for Natural Language Processing 62. Tang, Y.j., Chen, H.H.: Chinese Irony Corpus Construction and Ironic Structure Analysis. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics. pp. 1269–1278. Association for Computational Linguis- tics, Dublin, Ireland (2014) 63. Van Hee, C., Lefever, E., Hoste, V.: SemEval-2018 Task 3: Irony Detection in English Tweets. In: Proceedings of the 12th International Workshop on Semantic Evaluation. SemEval-2018, Association for Computational Linguistics, New Or- leans, LA, USA (June 2018) 64. Wallace, B.C.: Computational Irony: A Survey and New Perspectives. Artificial Intelligence Review 43(4), 467–483 (2015) 65. Wallace, B.C., Choe, D.K., Charniak, E.: Sparse, Contextually Informed Models for Irony Detection: Exploiting User Communities, Entities and Sentiment. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers). pp. 1035–1044. Association for Computational Linguistics, Beijing, China (July 2015) 66. Wallace, B.C., Choe, D.K., Kertz, L., Charniak, E.: Humans Require Context to Infer Ironic Intent (so Computers Probably do, too). In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 512–516. Association for Computational Linguistics, Baltimore, Maryland (June 2014) 67. Zhang, S., Zhang, X., Chan, J., Rosso, P.: Irony Detection via Sentiment-based Transfer Learning. Information Processing & Management 56(5), 1633 – 1644 (2019). https://doi.org/10.1016/j.ipm.2019.04.006 256