Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: A Sentiment Polarity Classifier for YouTube Comments with Code-switching between Tamil, Malayalam and English BalaSundaraRaman Lakshmanana , Sanjeeth Kumar Ravindranathb a DataWeave, Bengaluru, India b Exotel, Bengaluru, India Abstract Theedhum Nandrum is a sentiment polarity detection system using two approaches–a Stochastic Gradient Descent (SGD) based classifier and a Long Short-term Memory (LSTM) based Classifier. Our approach utilises language features like use of emoji, choice of scripts and code mixing which appeared quite marked in the datasets specified for the Dravidian Codemix - FIRE 2020 task. The hyperparameters for the SGD were tuned using GridSearchCV. Our system was ranked 4th in Tamil-English with a weighted average F1 score of 0.62 and 9th in Malayalam-English with a score of 0.65. We achieved a weighted average F1 score of 0.77 for Tamil- English using a Logistic Regression based model after the task deadline. This performance betters the top ranked classifier on this dataset by a wide margin. Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a novel application of Soundex. Our complete code is published in github at https://github.com/oligoglot/theedhum-nandrum. Keywords Sentiment polarity, Code mixing, Tamil, Malayalam, English, SGD, LSTM, Logistic Regression 1. Introduction Dravidian languages are spoken by 227 million people in south India and elsewhere. To improve production of and access to information for user-generated content of Dravidian languages [1, 2] organised a shared task. Theedhum Nandrum 1 was developed in response to the Dravidian-CodeMix sentiment classification task collocated with FIRE 2020. We were supplied with manually labelled training data from the datasets described in TamilMixSentiment [3] and MalayalamMixSentiment [4]. The datasets consisted of 11,335 training, 1,260 validation and 3,149 test records for Tamil-English code-mixed data and 4,851 training, 541 validation and 1,348 test records for Malayalam-English code- mixed data. The comments in the dataset exhibited inter-sentential switching, intra-sentential switching and tag switching [5, 6]. Even though Tamil and Malayalam have their own native scripts [7], most com- ments were written in Roman script due to ease of access to English Keyboard [8]. The comments often mixed Tamil or Malayalam lexicons with an English-like syntax or vice versa. Some comments FIRE 2020: Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India email: quasilinguist@gmail.com (B. Lakshmanan); sanjeeth@gmail.com (S.K. Ravindranath) orcid: 0000-0001-5818-4330 (B. Lakshmanan); 0000-0002-3799-4411 (S.K. Ravindranath) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 Theedum Nandrum is a phrase from the 1st century BCE Tamil literary work Puṟanānūṟu. Meaning “the good and the bad”, it is part of the oft-quoted lines “All towns are ours. Everyone is our kin. Evil and goodness do not come to us from others.” written by Kaṉiyan Pūngunṟanār. were written in native scripts but with intervening English expressions. Even though these languages are spoken by millions of people, they are still under-resourced and there are not many data sets avail- able for code-mixed Dravidian languages [9, 10, 11]. 2. Method Given the particular register of language used in YouTube comments and the fact that most of the com- ments used the Roman alphabet to write Tamil and Malayalam text without following any canonical spelling, we understood the importance of pre-processing and choice of features over other specifics of the Machine Learning model to be used. This was evident from the bench marking results on the gold dataset in TamilMixSentiment [3]. We used core libraries like Keras 2 and scikit-learn 3 for the classifiers. 2.1. Pre-processing We normalised the text using The Indic Library 4 to canonicalise multiple ways of writing the same phoneme in Unicode. We also attempted spelling normalisation by doing a brute force transliteration from Roman to Tamil or Malayalam, followed by a dictionary lookup using a SymSpell-based spell checker on a large corpus 5 . However, we did not get much success in finding dictionary matches up to edit distance 2, the highest supported value. We then chose to use an Indian language specific Soundex as a feature to harmonise the various spellings with some success as described in 2.2.2. Words from multiple corpora indexed by their Soundex values could be used to get canonical spellings where there is long-range variation. We can combine edit distance allowance and Soundex equivalence while looking up our dictionary. The potential utility of such a method is supported by the characterisation of the text of these datasets in [12]. 2.2. Feature Generation 2.2.1. Emoji Sentiment We noticed that a key predictor of the overall sentiment of a comment was the set of emoji used. Based on this observation, we extracted the emoji from text and used Sentimoji [13] to assign a sentiment (positive, negative or neutral) to the emoji. However, the list of emoji available in Sentimoji did not include a majority of the emoji found in our datasets. We used the sentiment labels in the training data to compute a sentiment polarity for each of the missing emoji based on the frequency of use in each class. We used both the raw emoji as well as its inferred sentiment as features. 2.2.2. Soundex As mentioned previously in 2.1, we used Soundex to harmonise the numerous spelling variants of the same word when expressed in the Roman alphabet. For example, the Tamil word நன ் ற is written as nanri and nandri in the corpus. The standard Soundex algorithm for English did not approximate Tamil and Malayalam words well. We found libindic-soundex 6 to perform very well. Soundex has 2 https://github.com/fchollet/keras 3 https://github.com/fchollet/keras 4 https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf 5 https://github.com/indicnlp/solthiruthi-sothanaikal 6 https://github.com/libindic/soundex been employed in spoken document classification [14, 15] where it helps in learning over transcription errors. Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a first of its kind. The specificity improves when the input text was in Tamil or Malayalam script rather than in Roman alphabets. Hence, we used indictrans [16] to perform a transliteration to native scripts before feeding the text to the Soundex generator function. That gave improved matches. For example, அ ைம and அ ம have a Soundex of அ PCND000, while arumai in Roman alphabets gets a65. This problem is mitigated by using indictrans as above before generating the Soundex values. 2.2.3. Language Tag Comments were not all in the expected language of the dataset. Some were in other languages either using their native scripts or the Roman alphabet. The classifier was expected to label not Tamil or not Malayalam as the case may be. To support that as well as to ensure the features specific to a language are aligned well, predicted language from Google Translation API 7 was added as a feature. Tagging parts of the code-mixed comments into respective languages should improve the classification accuracy further. 2.2.4. Word Vector We tokenised the text based on separators, but retained most other characters so as to not drop any non-word signals. We also added word ngrams up to length 4 as features. 2.2.5. Document length range We bucketed Document length into 21 ranges viz. 1-10, 11-20,...,>200 was used as a feature. This improved the performance. 2.3. Classifiers The task required us to classify the comments into 5 classes viz. mixed_feelings, negative, positive, not-tamil/not-malayalam, unknown_state. After evaluating various other linear models, we picked SGD as the best performing algorithm for the data at hand with the features we had used at the time of benchmarking. Additionally, we trained an LSTM-based classifier [17] which did not perform as well as the linear classifier. A combined approach may perform better in the face of text mixed with multi-modal noise [18]. 2.3.1. Stochastic Gradient Descent (SGD) Classifier Based on parameter tuning, we arrived at the following configuration which gave the best perfor- mance on trials using the training dataset. An SGD classifier [19] with modified Huber loss and a learning rate of 0.0001 was used. Different weights were applied to the features of Tamil and Malay- alam. 7 https://cloud.google.com/translate/docs/reference/rest/v3/projects/detectLanguage Table 1 Theedhum Nandrum Performance. W-weighted average Language Dataset W-Precision W-Recall W-F1 Score Validation (SGD) 0.74 0.65 0.68 Tamil Validation (LSTM) 0.46 0.68 0.55 Test 0.64 0.67 0.62 Validation (SGD) 0.73 0.64 0.67 Malayalam Validation (LSTM) 0.17 0.41 0.24 Test 0.67 0.66 0.65 2.3.2. Long Short-term Memory (LSTM) A 4-layer sequential model was trained. Embedding, SpatialDropout, LSTM and a Densely-connected Neural Network were the layers. Softmax was used in the last layer to generate probability distribu- tion on all classes. We used categorical cross entropy loss and Adam optimiser with a learning rate of 0.0001. The learning seemed to maximise at 15 epochs for Tamil and 10 for Malayalam. Based on the results in Table 1, we found that it performed worse than the SGD Classifier. We identified that there was considerable overfitting because of the class imbalance in the relatively small training dataset. A pre-trained embedding combined with transfer learning could improve the performance [20]. 2.4. Parameter Tuning Tuning and optimisation of the SGD model was performed using grid-based hyper-parameter tuning. Since a FeatureUnion of Transformers was used with a Stochastic Gradient Classifier, two types of parameters were optimised. 1. Parameters of the Classifier 2. Weights of the Transformers in the FeatureUnion For the Classifier, the main parameters that were optimised are the loss function and regularisation term (penalty). Tuning was also performed on the weights of the transformers of the FeatureUnion. The features used by the model are mentioned in 2.2. We observed that though the features used for classification were common to both Tamil and Malayalam language documents, the classification accuracy improved with different weights for the features for Tamil and Malayalam. For e.g., having a higher weight for Document Length Range (mentioned in 2.2.5) improved results for Malayalam. 3. Results We tuned our SGD and LSTM classifiers using the available training data against the validation sets. We then classified the unlabelled test data using the optimised classifiers. We submitted the output from three of our best performing configurations between LSTM and SGD classifiers. The results for the test data were from the task organisers who picked the best of 3 classifications. The combined results are tabulated in Table 1. The above results are better than the benchmark done in TamilMixSentiment [4]. Theedhum Nan- drum was ranked 4th in the Dravidian-CodeMix task competition for Tamil-English, the weighted average F1 score was only 0.03 less than the top ranked team SRJ 8 . With an average F1 score of 8 https://dravidian-codemix.github.io/2020/Dravidian-Codemix-Tamil.pdf Table 2 Theedhum Nandrum Logistic Regression Model Performance. W-weighted average Language Dataset W-Precision W-Recall W-F1 Score Validation 0.91 0.70 0.78 Tamil Test 0.91 0.68 0.77 Validation 0.80 0.54 0.61 Malayalam Test 0.73 0.67 0.69 0.65, Theedhum Nandrum was ranked 9th in the Dravidian-CodeMix task competition for Malayalam- English 9 . After the task deadline, we ran a benchmark on other linear models with the full set of features above. Logistic Regression performed much better giving a weighted average of 0.77 for Tamil and 0.69 for Malayalam with the following parameters C=0.01, penalty=‘l2’,solver=‘newton-cg’ as shown in Table 2. Since we picked SGD based on our benchmarking performed before we added the Soundex feature, we had overlooked this better performing configuration. The performance of this classifier even exceeds the top ranked-classifier for the Tamil-English dataset by a wide margin. 4. Conclusion Theedhum Nandrum demonstrates that SGD and Logistic Regression based models leveraging spelling harmonisation achieved by using language-specific Soundex values as features for code-mixed text perform well on the code-mixed datasets specified for the Dravidian Codemix - FIRE 2020 task. Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a first of its kind. In addition, emoji are a useful feature in sentiment prediction over YouTube com- ments. Future work is required to validate the usefulness of spelling correction using a combination of edit distance and Soundex. Acknowledgments The authors are grateful for the contributions of Ishwar Sridharan to the code base. Particularly, the idea of using emoji as a feature is owed to him, among other things. We also thank Shwet Kamal Mishra for his inputs relating to LSTM. The logo for Theedhum Nandrum software was designed by Tharique Azeez signifying the duality of good and bad using the yin yang metaphor. Theedhum Nandrum also benefited from open source contributions of several people. References [1] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Eliz- abeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in Code- Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020. 9 https://dravidian-codemix.github.io/2020/Dravidian-Codemix-Malayalam.pdf [2] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Eliz- abeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in Code- Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE ’20, 2020. [3] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for senti- ment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources associa- tion, Marseille, France, 2020, pp. 202–210. URL: https://www.aclweb.org/anthology/2020.sltu-1. 28. [4] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Lan- guage Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/2020.sltu-1.25. [5] P. Ranjan, B. Raja, R. Priyadharshini, R. C. Balabantaray, A comparative study on code-mixed data of indian social media vs formal text, in: 2016 2nd International Conference on Contempo- rary Computing and Informatics (IC3I), 2016, pp. 608–611. [6] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Improving wordnets for under-resourced languages using machine translation, in: Proceedings of the 9th Global WordNet Conference (GWC 2018), 2018, p. 78. [7] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Wordnet gloss translation for under-resourced lan- guages using multilingual neural machine translation, in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, 2019, pp. 1–7. [8] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Comparison of different orthographies for machine translation of under-resourced Dravidian languages, in: 2nd Conference on Language, Data and Knowledge (LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [9] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, J. P. McCrae, A survey of current datasets for code-switching research, in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 136–141. [10] R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, J. P. McCrae, Named entity recognition for code-mixed indian corpus using meta embedding, in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 68–72. [11] B. R. Chakravarthi, P. Rani, M. Arcan, J. P. McCrae, A survey of orthographic information in machine translation, arXiv e-prints (2020) arXiv–2008. [12] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation of under-resourced languages, Ph.D. thesis, NUI Galway, 2020. URL: http://hdl.handle.net/10379/ 16100. [13] P. K. Novak, J. Smailović, B. Sluban, I. Mozetič, Sentiment of emojis, PLOS ONE 10 (2015) e0144296. URL: https://doi.org/10.1371/journal.pone.0144296. doi:1 0 . 1 3 7 1 / j o u r n a l . p o n e . 0144296. [14] P. Dai, U. Iurgel, G. Rigoll, A novel feature combination approach for spoken document clas- sification with support vector machines, in: Proc. Multimedia information retrieval workshop, Citeseer, 2003, pp. 1–5. [15] M. A. Reyes-Barragán, L. Villaseñor-Pineda, M. Montes-y Gómez, A soundex-based approach for spoken document retrieval, in: Mexican International Conference on Artificial Intelligence, Springer, 2008, pp. 204–211. [16] I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, M. Shrivastava, IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search, in: Proceedings of the Forum for Information Retrieval Evaluation, FIRE ’14, ACM, New York, NY, USA, 2015, pp. 48–53. URL: http://doi.acm. org/10.1145/2824864.2824872. doi:1 0 . 1 1 4 5 / 2 8 2 4 8 6 4 . 2 8 2 4 8 7 2 . [17] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997) 1735– 1780. URL: https://doi.org/10.1162/neco.1997.9.8.1735. doi:1 0 . 1 1 6 2 / n e c o . 1 9 9 7 . 9 . 8 . 1 7 3 5 . [18] P. Agrawal, A. Suri, NELEC at SemEval-2019 task 3: Think twice before going deep, in: Pro- ceedings of the 13th International Workshop on Semantic Evaluation, Association for Computa- tional Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 266–271. URL: https://www.aclweb. org/anthology/S19-2045. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 4 5 . [19] T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algo- rithms, in: Twenty-first international conference on Machine learning - ICML ’04, ACM Press, 2004. URL: https://doi.org/10.1145/1015330.1015332. doi:1 0 . 1 1 4 5 / 1 0 1 5 3 3 0 . 1 0 1 5 3 3 2 . [20] P. Liu, W. Li, L. Zou, NULI at SemEval-2019 task 6: Transfer learning for offensive language detection using bidirectional transformers, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 87–91. URL: https://www.aclweb.org/anthology/S19-2011. doi:1 0 . 1 8 6 5 3 / v 1 / S19- 2011.