=Paper=
{{Paper
|id=Vol-2111/paper5
|storemode=property
|title=A Dataset for Detecting Irony in Hindi-English Code-Mixed Social Media Text
|pdfUrl=https://ceur-ws.org/Vol-2111/paper5.pdf
|volume=Vol-2111
|authors=Deepanshu Vijay,Aditya Bohra,Vinay Singh,Syed Sarfaraz Akhtar,Manish Shrivastava
|dblpUrl=https://dblp.org/rec/conf/esws/VijayBSAS18
}}
==A Dataset for Detecting Irony in Hindi-English Code-Mixed Social Media Text==
A Dataset for Detecting Irony in Hindi-English Code-Mixed Social Media Text Deepanshu Vijay* , Aditya Bohra* , Vinay Singh, Syed S. Akthar, and Manish Shrivastava Language Technology Research Centre, International Institute of Information Technology, Hyderabad, {deepanshu.vijay,aditya.bohra,vinay.singh, syed.akhtar}@research.iiit.ac.in m.shrivastava@iiit.ac.in Abstract. Irony is one of many forms of figurative languages. Irony de- tection is crucial for Natural Language Processing (NLP) tasks like sen- timent analysis and opinion mining. From cognitive point of view, it is a challenge to study how human use irony as a communication tool. While relevant research has been done independently on code-mixed social me- dia texts and irony detection, our work is the first attempt in detecting irony in Hindi-English code-mixed social media text. In this paper, we study the problem of automatic irony detection as a classification prob- lem and present a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter. The tweets are annotated with the language at word level and the class they belong to (Ironic or Non-Ironic). We also propose a supervised classification system for detecting irony in the text using various character level, word level, and structural features. Keywords: code-mixing, language detection, linguistics, svm, random forest, hate-speech. 1 Introduction Irony is a subtle form of humor, where there is a gap between the intended meaning and the literal meaning. Even though it is a widely studied linguistic phenomenon, no clear definition seems to exist [5]. Irony detection is a difficult task as irony often has ambiguous interpretations. Apart from it’s importance in sentiment analysis and opinion mining, irony detection is also vital in the areas of medical care and security [6]. Previous research related to this task has mainly been focused on monolingual texts [18, 2, 8, 5] due to the availability of large-scale monolingual resources. Popularity of opinion-rich online resources like review forums and microblogging sites has encouraged users to express and convey their thoughts all across the world in real time. In multilingual societies like India, users often interchange between two or more languages while commu- nicating online. * These authors contributed equally to this work. Title Suppressed Due to Excessive Length 39 Code-Mixing (CM) is a natural phenomenon of embedding linguistic units such as phrases, words or morphemes of one language into an utterance of another [13–15, 4]. English and Hindi are two of the most widely used languages in the world and to the best of our knowledge currently there are no online Hindi- English code-mixed resources available for detecting irony. Following are some instances of Hindi-English code-mixed tweets. It can be ob- served that T1 and T2 contain irony while T3 is a non-ironic tweet. T1 : “Wo ek teacher hai tab bhi life ke test mein fail ho gaya! Hahaha such irony :D” Translation : “He is a teacher yet he failed in the test of life! Hahaha such irony :D.” T2 : “The kahawat ‘old is gold’ purani hogaee. Aaj kal ki nasal kehti hai ‘gold is old’, but the old kahawat only makes sense. #MindF #Irony.” Translation : “The saying ‘old is gold’ is old. Today’s generation thinks ‘gold is old’ but only the old one makes sense. #MindF #Irony. ” T3 : “mere single hone ke bawzood mujhe ye nahi pata tha aaj rose day he #irony.” Translation : “Inspite of me being single, I didn’t know today is rose day #irony.” The structure of the paper is as follows. In Section 2, we review related research in the area of code mixing and irony detection. In Section 3, we describe the corpus creation and annotation scheme. In Section 4, we present our system architecture which includes the pre-processing steps and classification features. In Section 5, we present the results of experiments conducted using various character-level, word-level and structural features. In the last Section, we conclude our paper, followed by future work and references. 2 Background and Related Work [11] performed analysis of data from Facebook posts generated by English-Hindi bilingual users. Analysis depicted that significant amount of code-mixing was present in the posts. [21] formalized the problem, created a POS tag anno- tated Hindi-English code-mixed corpus and reported the challenges and prob- lems in the Hindi-English code-mixed text. They also performed experiments on language identification, transliteration, normalization and POS tagging of the dataset. [3] addressed the problem of shallow parsing of Hindi-English code- mixed social media text and developed a system for Hindi-English code-mixed text that can identify the language of the words, normalize them to their stan- dard forms, assign them their POS tag and segment into chunks. [19] addressed 40 Authors Suppressed Due to Excessive Length the problem of language identification on Bengali-Hindi-English Facebook com- ments. They annotated a corpus and achieved an accuracy of 95.76% using sta- tistical models with monolingual dictionaries. [12] developed a Question Classifi- cation system for Hindi-English code-mixed language using word level resources such as language identification, transliteration, and lexical translation. [1, 16] performed Sentiment Identification in code-mixed social media text. [18] proposed an algorithm for separating ironic from non-ironic similes in En- glish, detecting common terms used in this ironic comparison. [8] presented a cor- pus of Italian tweets which consisted of 25.450 tweets among which 12.5% tweets were ironic and 87.5% tweets were non-ironic. They evaluated their dataset using two systems. The first system relies on lexical and semantic features characteris- ing each word of a Tweet. The second system exploits words occurrences (BOW approach) as features useful to train a Decision Tree. [2] proposed a model to detect irony in English Tweets, pointing out that skipgrams which capture word sequences that contain (or skip over) arbitrary gaps, are the most informative features. [5] presented a corpus generated from review pairs on Amazon that can be used to identify sarcasm and irony in a tweet. [9] collected and annotated a set of ironic examples from a common collective Italian blog. 3 Corpus Creation and Annotation In this section we explain the scheme used for corpus creation and annotation. 3.1 Corpus Creation We constructed the Hindi-English code-mixed corpus using the tweets posted online since 2010. Tweets were scrapped from Twitter using the Twitter Python API which uses the advanced search option of twitter. We have mined the tweets using #irony, keywords ‘irony’ and ‘ironic’ and various hashtags from politics, sports and entertainment. The last three topics majorly but not essentially rep- resent non-ironic tweets. As it is evident from example T3 in section 1, it is not compulsory that irony is detected in all the tweets consisting of irony keywords and hashtags. We retrieved 1,19,885 tweets from Twitter in json format, which consists of information such as timestamp, URL, text, user, re-tweets, replies, full name, id and likes. An extensive semi-automated processing was carried out to remove all the noisy tweets. Noisy tweets are the ones which comprise only of hashtags or urls. Also, tweets in which language other than Hindi or English is used were also considered as noisy and hence removed from the corpus. Fur- thermore, all those tweets which were written either in pure English or pure Hindi language were removed, and thus, keeping only the code-mixed tweets. As a result, a dataset of 3055 code-mixed tweets was created. Newly created corpus and code is available online at Github.1 1 https://github.com/deepanshu1995/Irony-Detection-Hindi-English-Code-Mixed- Title Suppressed Due to Excessive Length 41 3.2 Annotation Annotation of the corpus was carried out as follows: Language at Word Level : For each word, a tag was assigned to its source language. Three kinds of tags namely, ‘eng’, ‘hin’ and ‘other’ were assigned to the words by bilingual speakers. ‘eng’ tag was assigned to words which are present in English vocabulary, such as “Amazing”, “Death”, etc. ‘hin’ tag was assigned to Hindi words such as “sapna” (Dream), “hakikat” (Reality). The tag ‘other’ was given to symbols, emoticons, punctuations, named entities, acronyms, and URLs. Ironic or Non-Ironic: : An instance of annotation is illustrated in figure 1. Each tweet is enclosed withintags. First line in every an- notation consists of tweet id. Language tags are added before every token of the tweet, enclosed within tags. Each tweet is annotated with one of the two tags (Ironic or Non-Ironic). Irony is detected in 782 tweets. Remaining 2273 code-mixed tweets do not contain irony. The annotated dataset (consist- ing of tweet id’s and annotated tag) with the classification system will be made available online later. Fig. 1. Annotated Instance 42 Authors Suppressed Due to Excessive Length 3.3 Inter Annotator Agreement Annotation of the dataset to detect presence of irony was carried out by two human annotators having linguistic background and proficiency in both Hindi and English. A sample annotation set consisting of 50 tweets (25 ironic and 25 non-ironic) selected randomly from all across the corpus was provided to both the annotators in order to have a reference baseline so as to differentiate between ironic and non ironic text. In order to validate the quality of annotation, we calculated the inter-annotator agreement (IAA) for irony annotation between the two annotation sets of 3055 code-mixed tweets using Cohen’s Kappa coefficient. Kappa score is 0.832 which indicates that the quality of the annotation and presented schema is productive. 4 System Architecture In this section, we present our machine learning model for detecting irony in the code-mixed dataset described in the previous sections. 4.1 Pre-processing Pre-processing of the code mixed tweets is carried out as follows. All the links and URLs are replaced with “URL”. Tweets often contain mentions which are directed towards certain users. We replaced all such mentions with “USER”. All the hashtags in the dataset are removed. All the emoticons used in the tweets are first stored to be used as a feature and then replaced with “Emoticon”. All the punctuation marks in a tweet are removed. However, before removing them we store the count of each punctuation mark since we use them as one of the features in classification. 4.2 Classification Features : In our work, we have used the following feature vectors to train our supervised machine learning model. 1. Character N-Grams : Character N-Grams are language independent and have proven to be very efficient for classifying text. These are also useful in the situation when text suffers from misspelling errors [10, 17, 20]. Group of characters can help in capturing semantic meaning, especially in the code- mixed language where there is an informal use of words, which vary signifi- cantly from the standard Hindi and English words. We use character n-grams as one of the features, where n vary from 1 to 3. 2. Word N-Grams : Bag of words feature is vital to capture the content in the text. Thus we use word n-grams, where n vary from 1 to 3 as a feature to train our classification models. Title Suppressed Due to Excessive Length 43 3. Laugh Words and Emoticons : Instead of using many exclamation marks internet users may use the sequence ‘lmao’ (i.e. laughing my ass of) or ‘lol’ (i.e. laughing out loud) or type hahaha. So we use a feature called laugh words which is the sum of all the internet laughs, such as ‘haha’, ‘lol’, ‘lmao’, ‘rofl’, ‘lel’, ‘hehehe’. We also use emoticons as a feature for irony detection since they often represent textual portrayals of a writer’s emotion in the form of symbols. We took a list of Western Emoticons from Wikipedia.2 4. Punctuations : Users often use exclamation marks when they want to express strong feelings. We count the occurrence of each punctuation mark in a sentence and use them as a feature. 5. Intensifiers : Users often tend to use intensifiers for laying emphasis on their feeling. A list of intensifiers was taken from Wikipedia. We count the number of intensifiers in a tweet and use the count as a feature. 6. Negation words : A list of negation words was taken from Christopher Pott’s sentiment tutorial.3 We count the number of negations in a tweet and use the count as a feature. 7. Structure : Ironical tweets in our dataset are often longer than other tweets. To capture this structure we use a group of features. (i) Number of characters present in the tweet. (ii) Number of words in the tweet. (iii) Average word length in the tweet. Table 1. F1 Score for each feature using SVM classifier. Features F1 Score All Features 0.77 Structural Features 0.64 Char N-Grams 0.77 Word N-Grams 0.70 Laugh Words + Emoticons 0.63 Punctuation Marks 0.63 Intensifiers 0.63 Negation Words 0.63 5 Experiments and Results We performed experiments with two different classifiers namely Support Vector Machines with radial basis function kernel and Random Forest Classifier. Since the size of feature vectors formed are very large, we applied chi-square feature 2 https://en.wikipedia.org/wiki/List of emoticons 3 http://sentiment.christopherpotts.net/lingstruc.html 44 Authors Suppressed Due to Excessive Length Table 2. F1 Score for each feature using Random Forest classifier. Features F1 Score All Features 0.72 Structural Features 0.65 Char N-Grams 0.72 Word N-Grams 0.72 Laugh Words + Emoticons 0.63 Punctuation Marks 0.67 Intensifiers 0.63 Negation Words 0.63 selection algorithm which reduces the size of our feature vector to 14004 . For training our system classifier, we have used Scikit-learn [7]. In all the experi- ments, we carried out 10-fold cross validation. Table 1 and Table 2 describe the F1 score of each feature along with the F1 score when all features are used, in the case of Support vector machine and Random forest classifier respectively. Support vector machine performs better than Random forest classifier and gives a highest F1 score of 0.77 when all features are used. Character N-Grams proved to be most efficient in SVM, while word n-grams and character n-grams both resulted in best F1 score in the case of Random Forest Classifier. 6 Conclusion and Future Work In this paper, we present an annotated corpus of Hindi-English code-mixed text, consisting of tweet ids and the corresponding annotations, which will be made freely available online later. We also present a supervised system used for detect- ing irony in the code-mixed text. The corpus consists of 3055 code-mixed tweets annotated as ironic or non-ironic. The features used in our classification system are character n-grams, word n-grams, emoticons, laugh words, punctuations, in- tensifiers and structural features. Best F1 score of 0.77 is achieved when all the features are incorporated in the feature vector using SVM as the classification system. As a part of future work, the corpus can be annotated with part-of-speech tags at word level which could yield better results. Moreover, the annotations and experiments described in this paper can also be carried out for code-mixed texts containing more than two languages from multilingual societies, in future. References 1. Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma: Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed 4 The size of feature vector was decided after empirical fine tuning Title Suppressed Due to Excessive Length 45 Text. In Proceedings of COLING 2016, the 26th International Conference on Com- putational Linguistics: Technical Papers, pp. 2482-2491. 2016. 2. Antonio Reyes, Paolo Rosso, and Tony Veale: A multidimensional approach for detecting irony in twitter. Language resources and evaluation 47, no. 1 (2013): 239-268. 3. Arnav Sharma, Sakshi Gupta, Raveesh Motlani, Piyush Bansal, Manish Srivastava, Radhika Mamidi, and Dipti M. Sharma: Shallow parsing pipeline for hindi-english code-mixed social media text. arXiv preprint arXiv:1604.03136 (2016). 4. Carol Myers-Scotton: Dueling Languages: Grammatical Structure in Code- Switching. Claredon. (1993). 5. Elena Filatova: Irony and Sarcasm: Corpus Generation and Analysis Using Crowd- sourcing. In LREC, pp. 392-398. 2012. 6. Erik Forslid and Niklas Wikén. Automatic irony-and sarcasm detection in Social media. (2015). 7. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al: Scikit-learn: Machine learning in Python. Journal of machine learning research 12, no. Oct (2011): 2825- 2830. 8. Francesco Barbieri, Francesco Ronzano, and Horacio Saggion. ”Italian irony detec- tion in twitter: a first approach.” In The First Italian Conference on Computational Linguistics CLiC-it, p. 28. 2014. 9. Gianti Andrea, Bosco Cristina, Bolioli Andrea, and Luigi Di Caro. ”Annotating irony in a novel italian corpus for sentiment analysis.” In 4th International Work- shop on Corpora for Research on EMOTION SENTIMENT & SOCIAL SIGNALS ES 2012, pp. 1-7. ELRA, 2012. 10. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins: Text classification using string kernels. Journal of Machine Learning Re- search 2, no. Feb (2002): 419-444. 11. Kalika Bali, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas: “I am borrow- ing ya mixing?” An Analysis of English-Hindi Code Mixing in Facebook. In Pro- ceedings of the First Workshop on Computational Approaches to Code Switching, pp. 116-126. 2014. 12. Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and Manish Shrivastava: Answer ka type kya he?: Learning to classify questions in code-mixed language. In Proceedings of the 24th International Conference on World Wide Web, pp. 853-858. ACM, 2015. 13. Luisa Duran: Toward a better understanding of code switching and interlanguage in bilinguality: Implications for bilingual instruction. The Journal of Educational Issues of Language Minority Students 14, no. 2 (1994): 69-88. 14. Marjolein Gysels: French in urban Lubumbashi Swahili: Codeswitching, borrowing, or both?. Journal of Multilingual & Multicultural Development 13, no. 1-2 (1992): 41-55. 15. Pieter Muysken: Bilingual speech: A typology of code-mixing. Vol. 11. Cambridge University Press, 2000. 16. Souvick Ghosh, Satanu Ghosh, and Dipankar Das: Sentiment Identification in Code-Mixed Social Media Text. arXiv preprint arXiv:1707.01184 (2017). 17. Stephen Huffman. Acquaintance: Language-independent document categorization by n-grams. DEPARTMENT OF DEFENSE FORT GEORGE G MEADE MD, 1995. 18. Tony Vealy and Yanfen Hao: Detecting Ironic Intent in Creative Comparisons. In ECAI, vol. 215, pp. 765-770. 2010. 46 Authors Suppressed Due to Excessive Length 19. Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster: Code mixing: A challenge for language identification in the language of social media. In Pro- ceedings of the first workshop on computational approaches to code switching, pp. 13-23. 2014. 20. William B. Cavnar, and John M. Trenkle: N-gram-based text categorization. Ann arbor mi 48113, no. 2 (1994): 161-175. 21. Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choud- hury: Pos tagging of english-hindi code-mixed social media content. In Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 974-979. 2014.