=Paper= {{Paper |id=Vol-2421/IroSvA_paper_10 |storemode=property |title=VRAIN at IroSva 2019: Exploring Classical and Transfer Learning Approaches to Short Message Irony Detection |pdfUrl=https://ceur-ws.org/Vol-2421/IroSvA_paper_10.pdf |volume=Vol-2421 |authors=Javier Iranzo-Sánchez,Ramon Ruiz-Dolz |dblpUrl=https://dblp.org/rec/conf/sepln/Iranzo-SanchezR19 }} ==VRAIN at IroSva 2019: Exploring Classical and Transfer Learning Approaches to Short Message Irony Detection== https://ceur-ws.org/Vol-2421/IroSvA_paper_10.pdf
VRAIN at IroSva 2019: Exploring Classical and
Transfer Learning Approaches to Short Message
                Irony Detection

                   Javier Iranzo-Sánchez[0000−0002−4035−3295] and
                       Ramon Ruiz-Dolz[0000−0002−3059−8520]

            Valencian Research Institute for Artificial Intelligence (VRAIN)
                     Camino de Vera s/n. 46022 Valencia (Spain)
                         {jairsan,raruidol}@vrain.upv.es



        Abstract. This paper describes VRAIN’s participation at IroSvA 2019:
        Irony Detection in Spanish Variants task of the Iberian Languagues Eval-
        uation Forum (IberLEF 2019). We describe the entire pre-processing, fea-
        ture extraction, model selection and hyperparameter optimization car-
        ried out for our submissions to the shared task. A central part of our
        work is to provide an in-depth comparison of the performance of differ-
        ent classical machine learning techniques, as well as some recent transfer
        learning proposals for Natural Language Processing (NLP) classification
        problems.

        Keywords: Natural Language Processing · Irony Detection · Transfer
        Learning


1     Introduction
From the linguistic point of view, irony is a very interesting property of language.
As defined in [9], irony is the ability of expressing some specific meaning by the
use of terms and words that, by its own, have the completely opposite meaning.
On the other hand, from a computational viewpoint, irony can be seen as an
important headache when performing natural language analysis tasks. For exam-
ple, in sentiment analysis, some of the most important features to determine the
text polarity are inferred from the words appearing in the text (e.g. negation,
n-grams, POS tagging, etc.) [11]. With an ironic text, some of these features
should be smoothed in order to perform correctly the sentiment analysis. Irony
can be a problem when performing sentiment analysis from a text, this issue has
been directly observed in past International Workshop on Semantic Evaluation
(SemEval[1]) editions. In 2015, [17] two different datasets were considered, one
without sarcastic tweets and other containing sarcastic tweets. Systems perfor-
mance was considerably lower with the sarcastic dataset. In fact, irony detection
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem-
    ber 2019, Bilbao, Spain.
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




was proposed as a task in 2015 [7] and tackled for English language tweets in
2018 [18].
    This paper describes VRAIN’s participation at IroSvA 2019: Irony Detec-
tion in Spanish Variants[1] task of the Iberian Languages Evaluation Forum
(IberLEF 2019) [13]. In this task we must identify ironic texts written in three
Spanish variants (from Spain, Mexico and Cuba). For Spain and Mexico sub-
tasks, we must detect ironic tweets and for Cuba subtask we are supposed to
detect ironic comments from a news website.
    We worked in each substask in isolation (only the Spanish tweets from Mexico
were used for the Mexico subtask and so on), but used the same approach and
pipeline in all 3 subtasks. Model selection and hyper-parameter optimisation
were individually carried out for each subtask.
    The rest of the paper is structured as follows. In Section 2 we explain the
feature extraction process carried out for this specific task. In Section 3 we briefly
describe the system and present all the different approaches taken into account
to perform our comparison. In Section 4 we present the evaluation made in order
to compare the behaviour of the different models considered in this work. We
also compare the results obtained by our approach with all the different baselines
provided by the organisation. Finally, in Section 5 we summarise the conclusions
and the most important features of our work.


2    Dataset and Feature Extraction

We will first begin by describing the structure of the competition’s dataset. Table
1 contains statistics of the training and test datasets. Both Spain (es) and Mexico
(mx) variants have 10 different topics and the amount of tweets is different for
each topic. On the other hand, Cuba (cu) variant has only 9 different topics and
the amount of news per topic is also different. Regarding the training dataset,
for all the Spanish variants we have 2400 samples, thus the corpus is balanced
regarding each variant. On the other hand, the test dataset is made up of 600
samples for every Spanish variant.


                   Table 1. Statistics of the IroSvA 2019 dataset.

    Variant Topics Train Samples (Ironic Samples) Test Samples (Ironic Samples)
      es      10            2400 (800)                     600 (200)
     mx       10            2400 (800)                     600 (200)
      cu       9            2400 (800)                     600 (200)



    Having described the most important features of the dataset, we will now
focus on the feature extraction process. The text was tokenized using NLTK’s [3]
TweetTokenizer. Additionally, we experimented with substituting all occurrences
of hashtags, url, user mentions and numbers by a generic topic for each category,
but we finally decided against it since it decreased the model’s performance.




                                          323
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




    Each tweet was represented by a vector of counts of word n-grams. Using
counts directly instead of tf-idf performed better in our exploratory experiments.
The dataset contains additional information apart from the tweets themselves.
Specifically, we are given the corresponding topic for each of the tweets. We
have tried two ways of leveraging this information. In the first approach, which
we have called global-model, only one model is trained for each subtask, and a
one-hot vector encoding the topic is appended to every sample. Therefore, in
this approach, we have a single model per sub-task, trained with data from all
the topics.
    In the second approach, which we have called topic-model, we trained one
model per topic. Thus, at training time, we trained each of the individual models
using only data from one topic, and at inference time, for each of the tweets,
we used the predictions of the model that has been trained using the data of
the tweet’s topic. The results of both approaches are compared and evaluated in
Section 4.


3     System Description

We will now describe the different models we tried for irony detection. In order
to select appropriate values for the hyperparameters of each model, we carry out
5-fold Cross Validation, and select the configurations that obtained higher F1
(macro-averaged). Unless otherwise noted, methods are implemented using the
sklearn toolkit [14].


3.1   Classification approaches

 – Naive Bayes: The Naive Bayes approach is a well-known technique for
   tackling many classification problems. A Multinomial distribution is used to
   model P (xi |c).
 – Support Vector Machines: Support Vector Machines [5] are Maximum
   Margin Classifiers that have been shown to obtain good results in a variety
   of tasks. We use a linear kernel, that has been shown to outperform other
   non-linear kernels in text classification problems [20].
 – Gradient Tree Boosting: Gradient Tree Boosting is a boosting technique
   that consists in an ensemble of tree models built in a sequential way from
   a set of weak learners. We have used the implementation available in the
   XGBoost toolkit [4].
 – Linear Models (fastText): fastText [8] is a toolkit implementing a set of
   linear architectures for text classification. The model based on the CBOW
   architecture [12], has a word embedding matrix used to look up a representa-
   tion of each word in the text. The embeddings are summed and averaged into
   a fixed-sized vector, which is then fed into a softmax classifier. Additionally,
   we have also trained a version using pre-trained word-embeddings, using a
   publicly available dataset of 200-dimensional word embeddings trained on
   Spanish tweets [2].




                                          324
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




 – BERT: BERT [6] is a pre-training methodology for Transformer models [19].
   BERT models are pre-trained on massive amounts of unsupervised text data,
   and can then be used in a transfer-learning approach for other downstream
   tasks. For this task, we have used the pre-trained BERT-Base Multilingual
   Cased model, and fine-tuned it on the IroSvA data for 10 epochs.


4   Experimental Evaluation
The results obtained by the different models are shown in Table 2.


Table 2. Model performance measured with 5-fold CV performed over the training
data (macro F1).

                                                      Subtask
                Model                               es mx cu
                Naive Bayes (Topic model )         0.67 0.51 0.50
                Naive Bayes                        0.63 0.52 0.57
                fastText                           0.63 0.62 0.61
                fastText(Tweeter pre-trained)      0.63 0.62 0.61
                BERT                               0.61 0.50 0.57
                SVM (Topic model)                  0.70 0.60 0.58
                SVM                                0.70 0.60 0.66
                Gradient Tree Boosting (Topic)     0.52 0.50 0.45
                Gradient Tree Boosting             0.69 0.60 0.66
                Ensemble (SVM + Gradient Boosting) 0.71 0.65 0.66


    We can see a number of interesting results from the table. First, except for a
single case (Naive Bayes for the es variety), topic models obtain similar or worse
results than their global counterparts. Most likely due to the reduced number
of training data, the fine-grained approach of individually modelling each topic
seems counterproductive.
    In terms of the transfer learning approaches we tried, we have not been able
to leverage the knowledge obtained from the pre-trained tasks. The fastText
model using pre-trained embeddings does not improve the results of the base
fastText model, and the BERT model obtains results similar to the Naive Bayes
model.
    Overall, the best results are obtained by the SVM and Gradient Boosting
models. In order to further improve the results, we have constructed an Ensemble
of the SVM and Gradient Boosting models, whose predictions are the average
of the individual models’ predictions. This obtains additional improvements in
the es and mx variants, and was the model submitted to the competition. Table
3 shows the performance of our model compared to the competition baselines.
    The results obtained by our model present significant variations depending
on the task. In the case of the es task, our model outperforms all baselines, and
in the mx, our system comes in second place behind the LDSE [15] baseline.




                                          325
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




Table 3. System comparison between our submission and the competition baselines
(macro F1)

                                      Subtask
                    Model         es     mx     cu Average
                    LDSE [15]   0.6795 0.6608 0.6335 0.6579
                    W2V         0.6823 0.6271 0.6033 0.6376
                    Word nGrams 0.6696 0.6196 0.5684 0.6192
                    MAJORITY 0.4000 0.4000 0.4000 0.4000
                    VRAIN       0.6842 0.6476 0.5204 0.6174


However, in the case of the cu task, our model is only able to beat the major-
ity baseline. We do not know the reasons for the significant performance drop
in the cu task between our internal experiments and the competition results,
although one possible culprit is the aforementioned domain mismatch between
the {es, mx} and cu tasks.


5   Conclusions
This paper has described VRAIN’s submission to IroSvA 2019. The different
experiments have shown that, under the current conditions, classical models
have an edge over some of the recent transfer-learning techniques that we tested.
We believe that the limiting factor is the lack of sufficient training data for the
finetuning step.
    Our submission, based on an Ensemble of SVM and Gradient Tree Boosting
models, obtains good results across the board, altough the performance could
be improved in the cu case. This has been achieved using non-task-specific bag
of n-gram features. It is expected that these results could be further improved
with specific features for irony detection, such as those from [16,10].

Acknowledgements The research leading to these results has received funding
from the European Union’s Horizon 2020 research and innovation programme
under grant agreement no. 761758 (X5gon) and from the Valencian Government
grant for excellence research groups PROMETEO/2018/002.


References
 1. IroSvA 2019: Irony detection in spanish variants. http://www.autoritas.net/
    IroSvA2019/, accessed: 2019-07-05
 2. Word embeddings trained with word2vec on 200 million spanish tweets using 200
    dimensions, http://new.spinningbytes.com/resources/wordembeddings/
 3. Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 21st International
    Conference on Computational Linguistics and 44th Annual Meeting of the As-
    sociation for Computational Linguistics, Proceedings of the Conference, Sydney,
    Australia, 17-21 July 2006 (2006), http://aclweb.org/anthology/P06-4018




                                          326
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




 4. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings
    of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
    Data Mining, San Francisco, CA, USA, August 13-17, 2016. pp. 785–794 (2016),
    https://doi.org/10.1145/2939672.2939785
 5. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297
    (1995), https://doi.org/10.1007/BF00994018
 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 7. Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J.A., Reyes,
    A.: Semeval-2015 task 11: Sentiment analysis of figurative language in twit-
    ter. In: Proceedings of the 9th International Workshop on Semantic Evaluation,
    SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015. pp. 470–478
    (2015), http://aclweb.org/anthology/S/S15/S15-2080.pdf
 8. Grave, E., Mikolov, T., Joulin, A., Bojanowski, P.: Bag of tricks for efficient text
    classification. In: Proceedings of the 15th Conference of the European Chapter of
    the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April
    3-7, 2017, Volume 2: Short Papers. pp. 427–431 (2017), https://aclanthology.
    info/papers/E17-2068/e17-2068
 9. Grice, H.P., et al.: Logic and conversation
10. Hernández-Farı́as, I., Patti, V., Rosso, P.: Irony detection in twitter: The role of
    affective content. ACM Trans. Internet Techn. 16(3), 19:1–19:24 (2016), https:
    //doi.org/10.1145/2930663
11. Liu, B.: Sentiment analysis and opinion mining. Synthesis lectures on human lan-
    guage technologies 5(1), 1–167 (2012)
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
    sentations in vector space. In: 1st International Conference on Learning Represen-
    tations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track
    Proceedings (2013), http://arxiv.org/abs/1301.3781
13. Ortega-Bueno, R., Rangel, F., Hernández Farı́as, D.I., Rosso, P., Montes-y-Gómez,
    M., Medina Pagola, J.E.: Overview of the Task on Irony Detection in Spanish
    Variants. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF
    2019), co-located with 34th Conference of the Spanish Society for Natural Language
    Processing (SEPLN 2019). CEUR-WS.org (2019)
14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in python. Journal of Machine Learning Research 12, 2825–2830 (2011),
    http://dl.acm.org/citation.cfm?id=2078195
15. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation
    for language variety identification. In: Proceedings of the 17th International Con-
    ference on Computational Linguistics and Intelligent Text Processing (CICLing
    2016). LNCS, vol. 9624, pp. 156–169. Springer-Verlag (2018)
16. Reyes, A., Rosso, P., Veale, T.: A multidimensional approach for detecting irony
    in twitter. Language Resources and Evaluation 47(1), 239–268 (2013), https:
    //doi.org/10.1007/s10579-012-9196-x
17. Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S., Ritter, A., Stoyanov,
    V.: Semeval-2015 task 10: Sentiment analysis in twitter. In: Proceedings of the
    9th international workshop on semantic evaluation (SemEval 2015). pp. 451–463
    (2015)




                                           327
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




18. Van Hee, C., Lefever, E., Hoste, V.: Semeval-2018 task 3: Irony detection in en-
    glish tweets. In: Proceedings of The 12th International Workshop on Semantic
    Evaluation. pp. 39–50 (2018)
19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information
    Processing Systems 30: Annual Conference on Neural Information Processing Sys-
    tems 2017, 4-9 December 2017, Long Beach, CA, USA. pp. 6000–6010 (2017),
    http://papers.nips.cc/paper/7181-attention-is-all-you-need
20. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR
    ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on
    Research and Development in Information Retrieval, August 15-19, 1999, Berkeley,
    CA, USA. pp. 42–49 (1999), https://doi.org/10.1145/312624.312647




                                          328