POS Tagging Model for Malay Tweets Using New POS Tagset and BiLTSM-CRF Approach Sabrina Tiun 1, Siti Noor Allia Noor Ariffin 1 and Yee Dhong Chew 2 1 ASLAN lab, Center of Artificial Intelligence, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 45600 Bangi, Selangor 2 NETS Solutions PTE LTD, 19th Floor – Wisma Lee Rubber, 50100 Kuala Lumpur Abstract This paper proposes a Malay Part-of-Speech (POS) tagger using a new set of POS tags and a deep learning classifier. A new set of POS tags was proposed, and the POS classifier was built using a deep learning model, a bidirectional long short-term memory network with a conditional random field (BiLSTM-CRF). We expanded the POS tagset by considering the informal Malay terms frequently used in Malay tweet text. Additionally, for the Bi-LTSM- CRF model, we used a combined embedding of Malay Wiki Word2Vec and a Malay Twitter corpus annotated with POS Word2Vec. The BiLSTM-CRF Malay POS tagger was then compared to traditional classifier models. Based on evaluation results, the BiLSTM-CRF model performed the best. Thus, we concluded that deep learning techniques combined with appropriate embeddings are capable of performing fine-grained POS tagging on Malay tweet text. Additionally, using a customized POS tagset for specific types of text results in increased POS label coverage. Keywords 1 Part-of-speech tagging, Malay POS tagger, Malay Tweets, BiLSTM-CRF 1. Introduction The Part-of-Speech (POS) classification method organizes words into groups based on how they are used and function in sentences. A POS tagger, on the other hand, is a software component that reads a text in many languages and assigns appropriate words to each word (or token) in the text. Malay has four primary POS tags: Nouns, verbs, adjectives, and word tasks [1]. On the other hand, these leading POS tags are better suited for tagging formal Malay language than informal Malay material, such as Malay Twitter data. Informal Malay contains a varied range of informal terminology. The varied range includes accent (or dialect) words, slang, titles (e.g., hang, mek), noises (words intended to describe sounds like laughter, cat purring, and knocking), and mixed languages (commonly a mixed of a Malay language with the English language). Additionally, Malaysians, particularly adolescents, are incredibly inventive when it comes to writing and coining new words [5]. As a result, Malay social media text, such as Malay tweets, is dense with colloquial Malay and peppered with mixed-language phrases and derogatory terms. As a result, POS tagging a Malay tweet is extremely complicated, time-consuming, and energy-intensive. Obtaining the POS tag to label these informal terms is one of the challenges. Some researchers either ignore these terms (by failing to label them) or repurpose the existing POS tagset. Both approaches reduce the accuracy of POS tagging on tweet text and cause POS tags to be mislabeled. To address this issue, we propose a new set of POS tags customized for Malay social media text, the tweet. On the other The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP), June 7-8, 2022, Koper, Slovenia EMAIL: sabrinatiun@ukm.edu.my (Sabrina Tiun); p102054@siswa.ukm.edu.my (Siti Noor Allia Noor Ariffin), A170527@siswa.ukm.edu.my (Yee Dhong Chew) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) hand, the POS tagging problem is highly dependent on the preceding and following sequences. Applying a neural network algorithm, such as a bidirectional long short-term memory network (BiLSTM) model to POS tagging will be extremely beneficial. In other words, the architecture of the BiLSTM model, which considers both previous and current sequences, increases the likelihood of constructing a high-accuracy prediction of the POS tag on the Malay tweet. Thus, to develop a high- performance POS tagger for Malay tweets, we propose a new POS tagset and a BiLSTM with Conditional Random Field (CRF) or BiLSTM-CRF model. 2. Research Method Our work in this paper consists of two parts: (1) to propose a finer POS tagset that covers nearly all types of words in the tweet text. (2) to build a POS tag classifier model based on the BiLSTM-CRF model. The following will describe in detail of our work: 2.1. Malay POS Tagset for Tweet Text Malay has four primary POS tags, but these are insufficient for categorizing words in the Malay Twitter corpus. As a result, we chose to create new Malay POS tags through the use of Malay formulas and the grammar of Safiah et al. [1]. Additionally, we modified the newly created POS tags to conform to the Malay Twitter data criteria by comparing them to the word classes identified in Safiah et al. [1] and Othman and Karim [2], as well as some prior findings on social media texts [3][4]. Furthermore, we discovered that several of the newly developed POS tags by Le et al. [3] are suitable for categorizing words in the Malay tweet. For instance, consider the FOR and NEG tags. The FOR tag is used to categorize foreign language words found in the study corpus. On the other hand, the NEG tag identifies words with negative connotations, such as those that refer to swearing. As a result, we chose to use Le et al. [3] 's two newly developed POS tags, namely FOR and NEG tags. The FOR and NEG tags can be used for Malay tweets since Malaysians enjoy combining words from multiple languages in tweets and using negative words to express emotions or disapproval of situations. In addition, removing a foreign language from the corpus changes the author's intended meaning as well as the writing structure of the tweets. When auto annotations and annotators attempt to tag the POS tags due to this change, an error will occur. In that case, we will not exclude foreign language terms, slang terms, or informal Malay expressions that follow this principle—instead, a unique POS tag designed for this type of word tagging will be used. Thus, we combine lists of POS tagset from the following sources: Safiah et al. [1], Othman and Karim [2], Le et al. [3] and Ariffin and Tiun [4] and our 18 new POS tagset. In other words, the new set of POS tags contains 45 POS tags, and the detail of the POS tagset can be seen in Appendix A. 2.2. BiLTSM-CRF Model for Malay POS tagging For the POS tag classifier model, we constructed our Malay POS tagging model based on BiLSTM- CRF. The model is based on these three phases: (1) data preparation, (2) building/training the BiLSTM- CRF Malay POS model, and (3) model evaluation. 30% (538) of the total data was set as test data in the data preparation. When training the biLSTM-CRF Malay POS model, three kinds of layers were built: The embedding layer, the BiLTSM layer, and the CRF layer (see the Figure 1): Figure 1: The BiLSTM-CRF Malay POS model (adapted from Zhang et al. [12]) At the embedding layer, a combination of embeddings was proposed; the Word2Vec of Malay Wikipedia [11] and the Word2vec of an annotated POS Malay tweets corpus. Both of the embeddings were trained with the parameter setting of 300 dimensions. As for the BiLTSM layer, the BiLTSM acts as a word feature extraction. Bi-LSTM takes contextual information and identifies class tagging potentials for each input. Finally, the CRF layer is used to classify the POS of a word. The CRF score function is used to find the POS sequence with the highest score and calculate the probability distribution of all POS sequences. The CRF layer benefits from knowing what word should be followed by what word. Such features are difficult to demonstrate in the neural network layer, especially on small data sets [6]. However, with the help of the CRF layer, even a model trained with a small dataset can be used to classify things well. Using the prepared dataset, we evaluated our BiLSTM-CRF model. In evaluating our model, a similar evaluation of Jason [12] that used training loss and validation loss were used to ensure our BiLSTM-CRF model was not underfit or overfit. Based on Figure 2 below, the small gap in learning curves between training loss and validation loss indicates the model was neither underfitted nor overfitted. Figure 2: Learning curves of the BiLSTM-CRF Malay POS model Afterwards, we ran several trainings using a set of hyperparameters. The most optimized hyperparameters for our BiLTSM-CRF Malay POS model were; learning algorithm Adam, dropout point 0, batch size 6, and 30 epochs. With such setting, our BiLSTM-CRF model managed to get 94% of Precision, Recall and F1-Score. 3. Result and discussion To evaluate our POS model, we compared it against four well-known traditional classifier models: Support Vector Machine (SVM)[7], Nave Bayes (NB)[9], Decision Tree (DT)[8] and K-nearest neighbour (KNN)[10]. All of the traditional models were trained based on word position features. To be precise, ten types of features were extracted: the preceding and following words; the prefix of each word (limited to the first three characters of the word); the suffix of each word (limited to the last three characters of the word); the length of the word; and the presence of a digit in the word. To train and evaluate traditional classifier models, the same dataset of Malay tweets (used in training and evaluate the BiLSTM-CRF mode) was used and divided into 70% (1,253 data) as trained data and 30% (538 data) as test data with 10-fold cross validation. To compare our BiLSTM-CRF model against the traditional classifier, evaluation metrics of precision, recall, and F1-score were used. The findings based on the metrics are shown in Table 1. Table 1 Result of BiLSTM-CRF classifier against traditional classifiers on tagging Malay POS Classifier Precision Recall F1-score SVM 0.93 0.91 0.92 NB 0.74 0.57 0.60 DT 0.89 0.87 0.89 KNN 0.85 0.88 0.85 BiLTSM-CRF 0.94 0.94 0.94 Though traditional and deep learning classifiers have distinct architectures (see Table 1), using the same dataset gives us a general view of the Malay POS BiLSTSM-CRF model’s performance. Given SVM classifier has an F1-Score of 92% compared to the BiLTSM- CRF with an F1-Score of 94%, the 2% difference is small. The results in Table 1 can probably be improved by employing SVM with more significant features, or the BiLTSM-CRF will perform better with a larger dataset. However, because BiLTSM-CRF has been shown to perform better with a large dataset, strengthening the BiLTSM-CRF model in the future will be a preferable option for future study. 4. Conclusion In conclusion, by creating new POS tags tailored to the words in Malay tweets, we ensure fewer unlabeled words with POS. The training of the BiLTSM-CRF model based on a combination of specific embeddings aids in increasing the performance in tagging POS in the tweet text. In other words, our proposed BiLSTM-CRF Malay POS tagging model and the new POS tagset are suitable for tagging POS in the Malay tweet text. 5. Acknowledgements This project is supported by the Malaysia Ministry of Higher Education under research code: FRGS/1/2020/ICT02/UKM/02/1 6. References [1] K. N. Safiah, F. M. Onn, H. H. Musa, A. H. Mahmood, Tatabahasa Dewan Edisi Ketiga, Dewan Bahasa dan Pustaka, Kuala Lumpur, 2010. [2] A. Othman, N. S. Karim, Kamus komprehensif Bahasa Melayu, Penerbit Fajar Bakti, Kuala Lumpur, 2005. [3] T. A. Le, D. Moeljadi, Y. Miura, T. Ohkuma, Sentiment Analysis for Low Resource Languages: A Study on Informal Indonesian Tweets, in: Proceedings of the 12th Workshop on Asian Language Resources, Osaka, Japan, 2016, pp. 123-131. [4] S. N. A. N. Ariffin, S. Tiun, Part-of-Speech Tagger for Malay Social Media Texts, GEMA Online® Journal of Language Studies, 18 (2018) 124 -142. [5] N. Jamali, Fenomena Penggunaan Bahasa Slanga dalam Kalangan Remaja Felda di Gugusan Felda Taib Andak: Suatu Tinjauan Sosiolinguistik, Jurnal Wacana Sarjana, 2 (2018) 1-1. [6] L. Marz, D., Trautmann, B. Roth, Domain Adaptation for Part-of-Speech Tagging of Noisy User- Generated Text, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019. [7] T. Nakagawa, T. Kudoh, Y. Matsumoto, 2001. Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines, in: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, 2001, pp. 325--331. [8] L. Màrquez, H. Rodríguez, Part-of-speech tagging using decision trees, in: Proceedings of the European Conference on Machine Learning, 1998, pp. 25-36. [9] R. Creţulescu, A. David, D. Morariu, L. Vinţan, Part of speech tagging with Naïve Bayes methods, in: Proceedings of the 18th International Conference on System Theory, Control and Computing (ICSTCC), 2014, pp. 446-451. [10] S. Gaber, N., Nazri A. M. Z., Omar, N., Abdullah, S., Part-of-Speech (POS) Tagger for Malay Language using Naïve Bayes and K-Nearest Neighbor Model. International Journal of Psychosocial Rehabilitation, 24 (2020) 5468-5476. [11] Malaya toolkit, Natural Language Toolkit library for Bahasa Melayu, 2020. URL: https://malaya.readthedocs.io/en/4.0/load-wordvector.html [12] Y. Zhang, X. Wang, Z. Hou, J. Li, Clinical named entity recognition from Chinese electronic health records via machine learning methods, JMIR Medical Informatics, 6 (2018) e50. Appendix A The proposed new POS tagset for Malay informal text which are the combination of the POS tagset from Safiah et al. [1], Othman dan Karim [2], Le et al. [3] dan Ariffin dan Tiun [4] and a new proposed POS. The bold POS depicts the new proposed POS. Description POS tag KN Noun ( e.g kereta 'car') KN-LD Noun-dialect ( e.g gerek 'bike') KN-KEP Noun-abbreviation (e.g keta 'car') GT Pronoun preposition (e.g sini 'here') GT-KEP Pronoun preposition-abbreviation (e.g ni 'here') GDT-KTY Pronoun-question ( e.g siapa 'who') GN1 1st person personal pronoun (e.g saya' me') GN1-LD 1st person personal pronoun-dialect ( e.g cheq 'me') GN2 2nd person personal pronoun ( e.g awak 'you') GN2-LD 2nd person personal pronoun-dialect ( e.g hang 'you') GN3 3rd person personal pronoun (e.g mereka 'they') GN3-LD 3rd person personal pronoun-dialect (e.g depa 'they') KK Verb (e.g lari 'run') KA Adjective (e.g dekat'near') KA-KEP Adjective abbreviation (e.g kat 'near) KH Conjuction (e.g dan 'and') KH-KEP Conjuction abbreviation ( e.g tapi 'but') KSR Interjection (e.g wah) KTY Question (e.g bila 'when') KPE Command (e.g sila 'please') KB Auxiliary verb (e.g akan 'will') KB-KEP Auxiliary verb abbreviation (e.g dah) KP Kata penguat/adverb (e.g sangat ‘very’, paling ‘very’ ) KPN Kata Penegas/adverb (e.g juga ‘too’) KNF Deny (e.g tidak 'no') KNF-KEP Deny abbreviation (e.g tak 'no') KPM Narrator (e.g ialah 'is') KS Noun preposition (e.g di 'at') KPB Justified word (e.g ya 'yes') KBIL Cardinal (e.g ribu 'thousand') KAR Preposition (e.g atas 'above') KAD Adverb (e.g sekarang ‘now’) KAD-KEP Adverb abbreviation (e.g dulu) FOR Foreign word (e.g I 'I") FOR-KEP Foreign word abbreviation ( e.g iols 'we') FOR-NEG Foreign word - Negative ( e.g b*shit) NEG Negative (e.g bodoh 'stupid') KD Preposition (e.g di 'at') MW Currency (e.g MYR ) LD Dialect (e.g SL Slang (e.g pastu 'after that') KEP Abbreviation ( e.g pi as pergi 'go') GL Title (e,g hang) BY Sound (non-speech) AWL Prefix (e.g anti-)