-

INGEOTEC at IberLEF 2019 Task HaHa

Jose Ortiz-Bejar

2 3

Eric Tellez

0 2

Mario Graff

0 2

Daniela Moctezuma

dmoctezuma@centrogeo.edu.mx 1 2

Sabino Miranda-Jimenez

sabino.mirandag@infotec.mx 0 2 0 CONACyT Consejo Nacional de Ciencia y Tecnolog a , Direccion de Catedras , Mexico 1 Centro de Investigacion en Ciencias de Informacion Geoespacial A.C. , Mexico 2 INFOTEC Centro de Investigacion e Innovacion en Tecnolog as de la Informacion y Comunicacion , Mexico 3 Universidad Michoacana de San Nicolas de Hidalgo , Mexico

2019

203 211

This manuscript describes INGEOTEC's participation in the second Humor Analysis based on Human Annotation (HAHA) task on IberLEF'2019. Our approach to solve the task was based on perform an extensive comparison of several text classifiers using a 80-20 holdout cross-validation methodology. We found that our generic text categorization and regression system ( TC) had the best performance. Finally, we conducted an analysis over the training dataset illustrating some of the task's complexity.

Humor Analysis Text Categorization Model's Performance Analysis

Informal writing is a complex process; it is full of ambiguity, subjective, and figurative statements. A human learns to interpret this kind of language and expressions from its culture and its environment. While it is quite natural for almost anybody to understand informal language, the case of modeling it becomes a complicated task, it is full of variants and dependencies in cultural traits that become humor identification a challenging task.

Therefore, in the end, the identification process can be tackled with a large and diverse knowledge database, a robust enough model of the text, and a high performing learning algorithm. However, automatic humor detection as a supervised learning problem is complicated since the language traits that make something humorous is hard to bound in a set of rules. It is even painful to achieve agreement among humans on what is funny and what is not; therefore, a labeled dataset must be carefully curated.

The IberLEF-2019 forum ran a task devoted to Humor Analysis based on Human Annotation (HAHA). Here, a set of human-labeled messages from Twitter are provided to train and test algorithms for humor identification (classification) or ranking (regression). More detailed, each text is labeled as humorous or not humorous; a score of the humor-intensity is also given to define a ranking problem.

This manuscript describes the participation of INGEOTEC team in HAHA challenge. In addition to describing our methodology and internal comparisons, we performed an analysis of the training set to explain our performance. In particular, we use our TC [ 13 ]. The HAHA challenge is described in Section 2. Section 3 describes our general methodology to solve the task. Section 4 is devoted to describing our systems while experimental methodology and results are discussed in Section . Finally, Section 7 concludes our contribution. 2

Task Description

Humor Analysis based on Human Annotation (HAHA) [ 3 ] asks for systems that classify tweets, in the Spanish language, as humorous or not. Also, it asks for systems that determine how funny tweets are. Those two tasks are described by HAHA organizers as follows: Humor detection determining if a tweet is a joke or not (intended humor by the author or not). The results of this task will be measured using F-measure for the humorous category and accuracy. F-measure is the primary measure for this task. Funniness score prediction predicting a funniness score value (average stars) for a tweet in a 5-star ranking, supposing it is a joke. The results of this task will be measured using the root-mean-squared error (RMSE).

The first task can be solved as a classification problem, while the second one can be tackled as a regression problem. The training set is a corpus of 24000 crowdannotated tweets, as described in [ 2 ]. Multiple annotators evaluated each tweet, and each annotation consists of the class (humorous or not) and the intensity (number of stars 0-5). The final label is determined using a voting scheme. Table 2 shows an example of the content of the provided dataset. 3

Our humor detection approach

We select to use the modeling procedure for humor analysis described in our previous work [ 8 ]. Briefly, the idea is to create a single model for both classification and regression tasks, and it is based on computing a sparse or a dense vector space model, and then try different learning methods that support the data model. The sparse vector space is created through T C using an optimized text model, based on the hyper parameter selection of the tool. For the dense modeling, we use diverse word embedding models and then summarize word's vectors into a single dense vector.

Figure 1 illustrates our generic supervised model for humor classification and regression. For our sparse vector models we start the process with the training set T , a set of short text messages; the text is preprocessed and tokenized using multiple schemes like word n-grams, character q-grams, and skip-grams. This bag of tokens is vectorized through a weighting scheme. This procedure generates the vector space X, which can be used by a classifier or a regressor, Y depict the output associated to the training set. The model's quality depends on the entire pipeline. The entire process is documented in [ 15 ].

T Y

VSM

Classifier

or Regressor

A similar procedure is made for dense models, that is, the text is preprocessed. We use both pretrained word embeddings and computed word embeddings to model our text; X is compute using the vector sum and normalization word-vectors that compose each text. 4

Systems description

Our best solutions at both tasks were obtained using TC system. However, we explored the use of fastText and flair along with multiple combinations of word embeddings which range from simple character to the state-of-the-art BERT. In the following sections, we describe several approaches in more detail as well as some findings to hypothesize why our attempts did not show any improvement concerning our TC baseline.

TC [ 15 ] is a minimalist tool that generates text classifiers that maximize a performance measurement. It manages the entire pipeline of a text classifier, as specified in the previous section. Under the hood, TC uses a linear Support Vector Machine as the classifier. The core idea behind TC is to define a parameter space describing a massive number of text-classifiers. The search in this space for a competitive text classifier using a set of heuristics to perform the search based on random search and hill climbing.

We also probe EvoMSA [ 5 ], our multilingual sentiment analysis system based on genetic programming. The core idea is the use of several and diverse models to solve the task using a stacking scheme guided by genetic programming. This approach is particularly robust with unbalanced classes. Besides, we tested our baseline algorithm for multilingual sentiment analysis (B4MSA) [ 12 ]; this system is a sentiment classifier for informal text such as Twitter messages. The design is similar to TC, but the internal problem is solved differently, along with the use of specific features for sentiment analysis and some language-dependent capabilities.

Additionally, we also test third-party tools like FastText [ 6 ], which is a library for text classification and word vector representation. It transforms the text into continuous vectors that can later be used on any language related task. FastText represents sentences with a weighted bag of words, and each word is represented as a bag of character n-gram to create text vectors. This representation is based on the skip-gram model [ 7 ], which take into account subword information and sharing information across classes through a hidden representation. Also, it employs a hierarchical softmax classifier that takes advantage of the unbalanced distribution of the classes to speed up computation. In addition to the default configuration, we optimized many of the parameters of FastText along with different preprocessing functions. We used random search over a configuration space for this purpose. Finally, we also test the multilingual library Flair [ 1 ], which implements state-of-the-art NLP models, such as named entity recognition, part-of-speech tagging, sense disambiguation, and classification. Flair allows to use and combine different word and document embeddings, among which stand out flair embeddings, BERT [ 4 ] embeddings, and ELMo (see [ 11 ]) embeddings. 5

Experiments and results

We tested multiple approaches by using the tools described above. The experimental setup consisted of using the set T of 24000 tweets human annotated by the task organizers. Firstly, T was split in training (Tt) and validation (Tv) sets following a 80-20 proportion. Table 2 describe the validation and training sets. and EvomSA enriched by different lexicon and decision functions learned from other sentiment analysis tasks. Table 3 summarizes the results of our experiments. To understand why there is no improvement over TC, we performed an analysis of the training and validation set vocabularies to determine the similarities/differences between them. For this propose we need to detail into the weighting scheme that our TC uses to tackle the problem, the entropy+b term-weighting defined in [ 14 ], it is based on representing each term by terms entropy computed from the empirical distribution of the available classes, using a smoothing parameter b, in this case, b = 3. More precisely, defined as follows: entropyb(w) = logjCj

Xpc(w;b)log c2C

1 pc(w;b) where C is the set of classes, and pc(w;b) is the probability of term w occurs in class c parametrized with b. More detailed, pc(w;b) =

freqc(w)+b b jCj+Pc2Cfreqc(w)

Here freqc denotes the frequency of the term in the class c. Using this approach, it is possible to find the set of most discriminant terms among classes; we can define thresholds for any vocabulary size if we normalize entropyb by the logarithm of the size's vocabulary such that terms are weighted with values between 0 to 1.

Table 4 shows the sizes of the vocabularies after removing those terms with less than 0.15 of normalized entropyb. The size of the training vocabulary is close to 80 thousand items while the vocabulary of the validation set has close 40 thousand entries. Please recall that we used an 80-20 partition, but it is also under a combination of tokenizers as determined by TC. To explain the findings, we refer to the terms set (1) (2) for training set as model Mt and the terms for validation as Mv. The union and intersection sizes of training and validation sets are also listed in the table.

The intersection and union sizes of Table 4 indicate the number of non-shared terms between the training and validation sets, due to it may be supposed that semantic models will be achieve good performance, however from our experiments using multiple state-of-the-art word embeddings do not outperformed TC model. To gain more understand we produce Figure 2 by sorting all terms according its entropy at the training set, it is set a value of 0 for all terms which are not in any of Mt or Mv. the training set than in the validation set. The negative zone collects those terms with a higher discriminant power in the training set. The zone around zero gathers terms that have a similar entropy score at both training and validation sets. Therefore, those terms with entropyb = 1 are highly discriminant in the training set but are not part of the vocabulary of the validation set; conversely, terms with -1 weight are those that are not part of the training set but exhibits unitary entropy values in the validation set.

The previous discussion suggests that semantic models may work better; however, pure word embeddings models achieved lower performance as described in our experimental results. Trying to improve, we also tried kernel methods, specialized in working with non-linear problems; in particular, we use the technique described in [ 9 ]. This approach separates the training set while does not show any significant improvement over validation dataset scores; this behavior is an obvious symptom of overfitting. (a) Raw vectors of the training set.

(b) Kernelized projection of the training set. (c) Raw vectors of the validation set.

(d) Kernelized projection of the validation set. The regression sub-task was tackled using the classification model of TC but replacing the Linear SVM classifier by the linear SVM regressor (SVR) available at scikit-learn [ 10 ]. This regressor was used to evaluate the submitted test dataset. 6

Task Results

Our best result using TC was ranked fifth out of 19 contestants in both, classification and regression tasks. Table 5 shows scores for TC, the organizer's baseline and the ones for the winner approach. For the classification task, measure F1 was used to decide the winner, while Root Mean Squared Error (RMSE) was used for the regression task. This paper describes the participation of the INGEOTEC team in the HAHA'19 challenge. Our final approach uses our TC system to perform both classification and regression tasks. For this edition, we test several approaches based on different semantic models, and our stacking solution EvoMSA; however, any of them was able to beat TC. We performed a qualitative analysis to find possible reasons for this situation. Possible reasons could be that our train and validation partitions contained very different vocabulary, semantics, and high bias, as was experimentally shown.

Acknowledgements

The authors would like to thank to the Consorcio en Inteligencia Artificial for partially funding this work through Project FORDECyT 296737.

1. Akbik , A. , Blythe , D. , Vollgraf , R.: Contextual string embeddings for sequence labeling . In: COLING 2018 , 27th International Conference on Computational Linguistics. pp. 1638 { 1649 ( 2018 )

2. Castro , S. , Chiruzzo , L. , Rosa , A. , Garat , D. , Moncecchi , G.: A crowd-annotated spanish corpus for humor analysis . In: Proceedings of SocialNLP 2018 , The 6th International Workshop on Natural Language Processing for Social Media ( 2018 )

3. Chiruzzo , L. , Castro , S. , Etcheverry , M. , Garat , D. , Prada , J.J. , Rosa , A. : Overview of HAHA at IberLEF 2019: Humor Analysis based on Human Annotation . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). CEUR Workshop Proceedings , CEUR-WS, Bilbao, Spain (9 2019 )

4. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 )

5. Graff , M. , Miranda-Jimenez , S. , Tellez , E.S. , Moctezuma , D. : Evomsa: A multilingual evolutionary approach for sentiment analysis . arXiv preprint arXiv:1812 . 02307 ( 2018 ), https://github.com/INGEOTEC/EvoMSA

6. Joulin , A. , Grave , E. , Bojanowski , P. , Mikolov , T. : Bag of tricks for efficient text classification . In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2 ,

Short

Papers . pp. 427 { 431 . Association for Computational Linguistics ( April 2017 )

7. Mikolov , T. , Sutskever , I. , Chen , K. , Corrado , G.S. , Dean , J. : Distributed representations of words and phrases and their compositionality . In: Advances in neural information processing systems . pp. 3111 { 3119 ( 2013 )

8. Ortiz-Bejar , J. , Salgado , V. , Graff , M. , Moctezuma , D. , Miranda-Jimenez , S. , Tellez , E.S.: Ingeotec at ibereval 2018 task haha: tc and evomsa to detect and score humor in texts . In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018 ) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018 ) ( 2018 )

9. Ortiz-Bejar , J. , Tellez , E.S. , Graff , M. , Miranda-Jimenez , S. , Ortiz-Bejar , J. , Moctezuma , D. , Sanchez , C.N.: I3go+ at ricatim 2017: A semi-supervised approach to determine the relevance between images and text-annotations . In: 2017 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC) . pp. 1 { 6 . IEEE ( 2017 )

10. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , Vanderplas , J. , Passos , A. , Cournapeau , D. , Brucher , M. , Perrot , M. , Duchesnay , E.: Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 { 2830 ( 2011 )

11. Peters , M.E. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . arXiv preprint arXiv:1802 . 05365 ( 2018 )

12. Tellez , E.S. , Miranda-Jimenez , S. , Graff , M. , Moctezuma , D. , Suarez , R.R. , Siordia , O.S.: A simple approach to multilingual polarity classification in Twitter . Pattern Recognition Letters 94 , 68 { 74 ( 2017 ). https://doi.org/10.1016/j.patrec. 2017 . 05 .024

13. Tellez , E.S. , Moctezuma , D. , Miranda-Jimenez , S. , Graff , M.: An automated text categorization framework based on hyperparameter optimization . Knowledge-Based Systems 149 , 110 { 123 ( 2018 ), https://github.com/INGEOTEC/microtc

14. Tellez , E. , Miranda-Jimenez , S. , Graff , M. , Moctezuma , D. : Gender and language-variety identification with MicroTC: Notebook for PAN at CLEF 2017 . In: CEUR Workshop Proceedings . vol. 1866 ( 2017 )

15. Tellez , E. , Moctezuma , D. , Miranda-Jimenez , S. , Graff , M.: An automated text categorization framework based on hyperparameter optimization . Knowledge-Based Systems ( 2018 ). https://doi.org/10.1016/j.knosys. 2018 . 03 .003