=Paper=
{{Paper
|id=Vol-2421/HAHA_paper_11
|storemode=property
|title=UTMN at HAHA@IberLEF2019: Recognizing Humor in Spanish Tweets using Hard Parameter Sharing for Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2421/HAHA_paper_11.pdf
|volume=Vol-2421
|authors=Anna Glazkova,Nadezhda Ganzherli,Elena Mikhalkova
|dblpUrl=https://dblp.org/rec/conf/sepln/GlazkovaGM19
}}
==UTMN at HAHA@IberLEF2019: Recognizing Humor in Spanish Tweets using Hard Parameter Sharing for Neural Networks==
UTMN at HAHA@IberLEF2019: Recognizing Humor in Spanish Tweets using Hard Parameter Sharing for Neural Networks Anna Glazkova1[0000−0001−8409−6457] , Nadezhda Ganzherli1 , and Elena Mikhalkova1[0000−0003−0781−8633] University of Tyumen, Tyumen, Russia {a.v.glazkova,n.v.ganzherli,e.v.mikhalkova}@utmn.ru Abstract. Automatic humor detection is a hard but challenging task. For the competition HAHA at IberLEF 2019 we built a neural networks classifier that uses different types of neural networks for specific sets of features. After being trained separately, the layers are concatenated to give the general output. The performance of our system on the binary detection of humorous tweets reaches F-score of 0.76 which is comparably higher than results of baseline machine learning classifiers and earns us the ninth place in the ranking table. As for task 2, where the system has to guess how funny the tweet was based on the number of stars that it got, our result is similarly good: RM SE = 0.945. However, much needs to be done to evaluate contribution of each of the feature sets and our choice of the type of neural network. Keywords: Humor detection · Neural networks · Hard parameter shar- ing · Feature engineering. 1 Introduction Humor detection is a non-trivial task that was considered by (16) to be of the AI-complete kind, as humor is “one of the most sophisticated forms of human intelligence”. However, with the rise of neural networks and semantic vector algorithms, e.g. the one suggested by (7), it has recently gained much attention of researchers and organizers of competitions: SemEval by (11; 8) and HAHA at IberLEF by (2; 4). A lot of systems at these competitions, including winners, apply machine learning and semantic vectors: 1. INGEOTEC by (10) uses a combination of machine learning methods and word embeddings. 2. UO UPV by (9) is based on a Bidirectional LSTM neural network and word embeddings. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem- ber 2019, Bilbao, Spain. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 3. JU-CSE-NLP by (12) “is a rule-based implementation of a dependency net- work and a hidden Markov model”. 4. Idiom Savant by (5) “consists of two probabilistic models... using Google n-grams and Word2Vec”. 5. PunFields by (6) applies a linear SVM classifier to a manually built thesaurus of English words. Our approach at HAHA@IberLEF2019 is not an exception from this trend. 2 Dataset and Preprocessing “HUMOR: A Crowd-Annotated Spanish Corpus for Humor Analysis” by (3) was created in 2017. At HAHA@IberLEF2019 the training set consisted of 16,000 tweets, manually annotated as humorous and not humorous and with the score of funniness calculated based on the average number of “stars” (5 maximum) given to a tweet by several independent readers. 1,200 tweets from the Training dataset were used for validation. The test set included 4,000 tweets. In addition, we used several datasets like pre-trained word embeddings and a sentiment dictionary. They are described in the next section. We first preprocess tweets with the help of our own software that includes the following steps: 1. Convert some markers of emotions into words: :( to tristeza and :) :D xD XD to reir. 2. Convert repetitive sequences of jaja... and JAJA... to a simple ja. 3. Pre-tokenize: add a space before and after punctuation symbols except # and . 4. Convert repetitions of the same letter of length ≥ 3 into a single letter and add a lemma EMPHASIS to the tweet so that it lexically denotes sentiment implied by letter repetitions: sooooomos to somos EMPHASIS. 5. Tokenize hashtags # and mentions : (a) De-capitalize capitalized sequences of more than one letter leaving the first letter capitalized: NUET to Nuet. (b) Add a space before and after every non-letter character in the sequence: PP#CiU to PP # CiU. (c) Add a space before every capitalized character in the sequence: Davini- aBono to Davinia Bono. 6. Convert emoticons into their word representations using unicode tables in Spanish1 . Our Python script for tweet preprocessing (except emoticons) and the system we used at the Competition will soon be available at https://github.com/ evrog/Spanish_Humor. As concerns the choice of steps, it is basically a tradeoff between not ruining words with traditional orthography and extracting as many 1 We used tables from https://unicode-table.com/es. 223 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) lemmas from Internet-specific speech as possible. The final stage of preprocessing is lemmatization with the help of SpaCy2 : the output is a list of lemmas and special characters in a tweet. 3 System Architecture As mentioned above, our model is based on a neural network. It processes several types of features in parallel, then concatenates the layers’ outputs and passes them to the Dense layer. (14) calls this approach “hard parameter sharing” and binds it to the work on Multitask Learning by (1). We used a subset of 1,200 tweets from the Training dataset for validation and accuracy, as a validation measure. To avoid overfitting, we used the early stopping strategy (the value of patience is 5) and the dropout regularization for the output layer (the fraction of the input units to drop is 0.8). The model learns on four sets of features in parallel: 1. Tweets represented as sequences. The length of word emebeddings is 300. The weight matrix is built from pretrained word embeddings for Span- ish3 . In our experiment, the Convolutional neural network learned better from these features than the Recurrent one, so these features are fed to CNN and a further combination of MaxPooling and a flattening layer. The Convo- lutional layer contains 64 filters with a kernel size of 5. The size of the max pooling windows is 20. 2. Tweets represented as a Bag-of-Words and smoothed with TF- IDF. These features are restricted to the 5,000 most frequent words, due to computation complexity of a larger vocabulary, and passed to a Dense layer. 3. Features of sentiment and topic modelling, extracted from tweets. To represent sentiment in every tweet, we used “affective norms” calculated by (15).4 For each word in a tweet we collected its six norms from the dictionary, getting a word vector. We summed these vectors to create a vector of a sentence and applied MinMax normalization to scale its values between 0 and 1: xi − xmin yi = (1) xmax − xmin As concerns topics, we used LDA from Gensim (13) to extract 20 most general topics from the collection (topic distribution) and calculate distance from each tweet to each topic. The features are passed to a Dense layer. 4. Additional features. These features include: (a) presence (0) or absence (1) of emoji, lists, word repetitions, special char- acters (e.g. !, ?); 2 https://spacy.io/api/lemmatizer 3 https://www.kaggle.com/rtatman/pretrained-word-vectors-for-spanish/ 4 The dictionary of norms can be downloaded here: http://crr.ugent.be/archives/ 1844. 224 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) (b) normalized with MinMax quantitative features: number of words, lines, minimum and maximum distance between embeddings, minimum Lev- enstein distance between a pair of words (to detect puns) applied to all possible pairs of words in a tweet. The features are also passed to a Dense layer. Scheme 1 demonstrates the general outline of our best performing neural network used for the task of binary classification. The scheme includes main parameters, e.g. the size of windows is 20 in 20: MaxPooling. The optimizer is adam (adaptive moment estimation); the loss function is binary crossentropy; the activation functions at hidden layers are ReLU, and for the output layer it is softmax. The last layer includes Dropout regularization with probability = 0.8. For the second task the architecture is similar to that of the first task, but for the input values that are separated into classes according to the average number of stars that the tweets earned. Also, in the second task, we use the mean standard error for validation. Fig. 1. Architecture of the best-performing neural network. 225 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 4 Test Results Table 1 demonstrates the result of our system compared to the winners of two tasks. The first four measures are for Task 1, and the last column presents Task 2. As concerns Task 1, the performance of our system is average compared to other teams’ results. However, it is high above the chance value and comparably higher than results of baseline machine learning classifiers. Table 1. Test results at the competition. Baseline result was suggested by the compe- tition organizers and was marked in the scoring table as “hahaPLN”. System F-score Precision Recall Accuracy Task 2: RMSE Our 0.760 (9) 0.756 (9) 0.765 (8) 0.812 (9) 0.945 (8) Winner 0.821 (1) 0.791 (4) 0.852 (1) 0.855 (1) 0.736 (1) Baseline 0.440 (19) 0.394 (19) 0.497 (18) 0.505 (18) 2.455 (14) 5 Conclusion Application of computer methods, in particular word embeddings and neural networks, in analysis of figurative speech and its varieties, such as humor, has recently proved to be very effective in annotation of large corpora. But also, it gives a new perspective on the analysis of language features that are important in humor production and appreciation. Our approach included testing sets of different features in a growing combination: at each step we added a feature set and a subset of a neural network into the architecture and checked if they im- proved our result. For example, including the sentiment dictionary (see above: “affective norms”) improved our F-score by 0.015. The features we chose are usual in the systems we mentioned in 1: vocabulary represented as word embed- dings, TF-IDF weighted Bag-of-Words, sentiment dictionary, special characters that represent emotions in Twitter (e.g. :)). As for the system architecture, we tested the so-called hard parameter shar- ing. Our system uses different neural networks that we empirically found to be more capable of dealing with each specific set of features. In general, we use CNN for embeddings and Dense layers for other feature sets. The result of our system is much higher than that of the baseline and is medium compared to other participants. However, the value of each of the sets features and the model of neural network have yet to be evaluated more closely. We plan to combine our features with other types of neural networks. Also, the model might have given a higher result in Task 2 if we used a regression model instead of multi-class classification. However, this is yet to be tested. Acknowledgements The reported study was funded by RFBR according to the research project No. 18-37-00272. 226 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Bibliography [1] Caruana, R.A.: Multitask learning: A knowledge-based source of inductive bias. In: Machine Learning Proceedings 1993, pp. 41 – 48. Morgan Kaufmann, San Francisco (CA) (1993). https://doi.org/https://doi.org/10.1016/B978-1-55860-307-3.50012- 5, http://www.sciencedirect.com/science/article/pii/ B9781558603073500125 [2] Castro, S., Chiruzzo, L., Rosá, A.: Overview of the haha task: Humor anal- ysis based on human annotation at ibereval 2018. In: CEUR Workshop Proceedings. vol. 2150, pp. 187–194 (2018) [3] Castro, S., Chiruzzo, L., Rosá, A., Garat, D., Moncecchi, G.: A crowd- annotated spanish corpus for humor analysis. In: Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media. pp. 7–11 (2018) [4] Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada, J.J., Rosá, A.: Overview of HAHA at IberLEF 2019: Humor Analysis based on Human Annotation. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain (9 2019) [5] Doogan, S., Ghosh, A., Chen, H., Veale, T.: Idiom savant at semeval-2017 task 7: Detection and interpretation of english puns. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 103–108 (2017) [6] Mikhalkova, E., Karyakin, Y.: Punfields at semeval-2017 task 7: Employ- ing rogets thesaurus in automatic pun recognition and interpretation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 426–431 (2017) [7] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Ad- vances in neural information processing systems. pp. 3111–3119 (2013) [8] Miller, T., Hempelmann, C., Gurevych, I.: SemEval-2017 Task 7: Detection and interpretation of English puns. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 58–68 (2017) [9] Ortega-Bueno, R., Muniz-Cuza, C.E., Pagola, J.E.M., Rosso, P.: Uo upv: Deep linguistic humor detection in spanish social media. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018) (2018) [10] Ortiz-Bejar, J., Salgado, V., Graff, M., Moctezuma, D., Miranda-Jiménez, S., Tellez, E.S.: Ingeotec at ibereval 2018 task haha: µtc and evomsa to detect and score humor in texts. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages 227 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) (IberEval 2018) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018) (2018) [11] Potash, P., Romanov, A., Rumshisky, A.: Semeval-2017 task 6:# hashtag- wars: Learning a sense of humor. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 49–57 (2017) [12] Pramanick, A., Das, D.: Ju cse nlp @ semeval 2017 task 7: Employing rules to detect and interpret english puns. In: Proceedings of the 11th Inter- national Workshop on Semantic Evaluation (SemEval-2017). pp. 432–435 (2017) [13] Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010), http: //is.muni.cz/publication/884893/en [14] Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017) [15] Stadthagen-Gonzalez, H., Imbault, C., Sánchez, M.A.P., Brysbaert, M.: Norms of valence and arousal for 14,031 spanish words. Behavior research methods 49(1), 111–123 (2017) [16] Stock, O., Strapparava, C.: Hahacronym: Humorous agents for humorous acronyms. Stock, Oliviero, Carlo Strapparava, and Anton Nijholt. Eds pp. 125–135 (2002) 228