=Paper=
{{Paper
|id=Vol-3226/paper4
|storemode=property
|title=TAMS: Text Augmentation using Most Similar Synonyms for SMS Spam Filtering
|pdfUrl=https://ceur-ws.org/Vol-3226/paper4.pdf
|volume=Vol-3226
|authors=Mohammad Qussai Jouban,Zakarya Farou
|dblpUrl=https://dblp.org/rec/conf/itat/JoubanF22
}}
==TAMS: Text Augmentation using Most Similar Synonyms for SMS Spam Filtering==
TAMS: Text Augmentation using Most Similar Synonyms for SMS Spam Filtering Mohammad Qussai Jouban, Zakarya Farou ELTE EΓΆtvΓΆs LorΓ‘nd University, Department of Data Science and Engineering, Institute of Industry - Academia Innovation, Budapest, Hungary Abstract Spam filtering is a non-standard derivative data science problem aiming to catch unsolicited and undesirable messages and prevent those messages from reaching a userβs inbox. To solve the abovementioned problem, we propose a text augmentation approach using the most similar synonyms called TAMS. We used Random forest and Bidirectional LSTM classification models for the experimental part to assess the proposed approach. The results indicate that training the classifiers with synthesized spam messages generated by TAMS reduces the influence of the imbalance problem present by nature in the dataset and improves the overall performance of the classification models. Hence, this study shows the potential of using TAMS to enhance the classification performance on textual data where the imbalance scenario is present. Keywords Spam filtering, Text classification, Text Augmentation, Imbalance learning 1. Introduction minority class. The most common solution to the im- balanced datasets problem is generating new samples Spam filtering is a non-standard derivative data science belonging to the minority class to make the dataset bal- problem [1]. Derivative because it is an extension of anced. core problems, i.e., classification problems. Non-standard, The paper is organized as follows: Section 2 describes since the data has an unusual distribution on the target the imbalance problem, data augmentation, used ma- variable, such problems belong to the imbalance problems chine learning models, some related terminologies and family. Spam filtering is also one of the most common related works. Section 3 introduces the proposed text problems in the Natural Language Processing domain. augmentation approach TAMS with a detailed explana- The main target of this problem is to identify the spam tion and practical examples. The experimental results, messages and filter them out from the legitimate mes- including the dataset, evaluation metrics, and results with sages. Solving this problem will be very beneficial for discussion, are presented in Section 4. Lastly, Section 5 telecommunication companies, where text messaging is outlines the conclusion about the conducted research and the most common non-voice use of a mobile phone. In the potential research direction to improve learning and fact, according to security firm Cloudmark, about 30 mil- classification of similar problems. lion spam messages are sent to cell phone users across North America, Europe, and the U.K. This study aims to improve the SMS spam filtering 2. Background by solving the most common problem in the available datasets, which is the imbalance problem, where most of Data science techniques are used to solve many prob- the samples belong to the legitimate class, i.e., legitimate lems. These methods can learn and likely extract hidden messages, the so-called majority class πΆ β , and a small patterns from the data used as input. Regardless of data proportion of the samples belongs to the spam class, i.e., modality (e.g., textual, visual, tabular), we classify data spam messages, so-called minority class πΆ + . Dealing science problems into standard and non-standard prob- with an imbalanced dataset is one of the main challenges lems. Standards problems mainly concern supervised in machine learning, especially in classification problems, learning (predictive problems) and unsupervised learn- where most well-known classification models tend to be ing (descriptive problems). However, there exist more biased toward the majority class and fail to identify the complex (non-standard) problems than the cited ones. These complex problems are derived or hybridized from the standard, i.e., core problems. ITATβ22: Information technologies β Applications and Theory, Septem- ber 23β27, 2022, Zuberec, Slovakia * Zakarya Farou. 2.1. Imbalance problem $ f7rg3j@@inf.elte.hu (M. Q. Jouban); zakaryafarou@inf.elte.hu (Z. Farou) Imbalance problem occurs when the target class has very 0000-0003-3996-2656 (Z. Farou) few samples opposed to the other classes, It is classified Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). as a non-standard derivative problem. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Technically, we call a dataset imbalanced regardless of and monitors the classification performance of RF. its data modality when there is a disproportion among the number of instances of each class, making classes 2.3. Data augmentation for textual data under-represented. Therefore, traditional machine learn- ing (ML) algorithms have complications defining the tar- Obtaining accurate results while training a classifier be- get classβs decision boundaries. comes difficult due to the lack of available, varied, and As various real-world applications and diverse do- meaningful data, especially when the imbalance prob- mains fall under the imbalance problem, researches on lem is present. Therefore, additional samples should be imbalanced data classification have expanded and gained added to train the classifier more efficiently. However, more interest [2]. Mainly, we face the imbalance problem gathering such data is time-consuming and needs domain in credit card fraud detection [3], anomaly detection [4], experts that examine and annotate the data. e-mail foldering [5], medical diagnosis [6] [7] [8], parti- As assembling such data is costly, synthesizing new cles identification [9], face recognition [10], fault diag- data from the existing ones seems to be a promising ap- nosis [11] [12], text classification [13] [14], and many proach, specifically if the quality of the generated data is others. as good as the original one. In the data science ecosys- tem, increasing the training dataset, i.e., generating addi- tional samples from the existing ones, is known as data 2.2. E-Mail Spam Filtering augmentation. For an imbalanced class problem, data Spam emails, also known as junk emails, are messages augmentation would help avoid overfitting, reduce the transmitted by spammers via email. Users are con- bias of the classifiers toward the majority class, and im- fronting several issues such as the abuse of traffic, limited proving the generalization ability of the trained models. storage space, computational power, waste of usersβ time, However, text augmentation would only be worthwhile and threat to user security. Therefore, appropriate email if the generated data has new linguistic patterns that filtering is essential to provide more security and increase are pertinent to the task and have not yet been seen in the efficacy of end users. Data Scientists conducted sev- pre-training. eral types of research on email filtering; some achieved In NLP, there are numerous data augmentation tech- good accuracy, and some continued. For instance, in [15], niques, such as paraphrasing [20], close embeddings [21], the authors developed a mobile SMS spam filtering for swapping [22], inducing spelling mistakes [23], delet- Nepali text and used NaΓ―ve bayesian and support vec- ing [24], and synonyms replacement [25][26][27]. tor machines as classifiers, while in [16], Mohammed et For this study, we mainly focus on local data augmen- al. present an approach for filtering spam email using tation, particularly token substitution, because it is a machine learning algorithms. At first, they used the to- cost-effective and easily accessible yet powerful textual kenization method to filter spam and ham words from data augmentation method. Token substitution is a pop- the training data and utilized them to create testing and ular method that replaces a token in the sentence with training tables and experienced with various data min- its synonym. ing algorithms. Furthermore, Singh et al. [17] discussed the solution and classification process of spam filtering 2.4. Supervised machine learning and presented a combining classification technique to get better spam filtering results. Other studies such as [18] Machine learning [28] is a form of artificial intelligence proposed a method for detecting malicious spam through that enables a system to learn from data rather than feature selection and improving the training time and through explicit programming. We can use supervised accuracy of a malicious spam detection system. learning algorithms for non-standard derivative problems Despite the numerous proposals, most anti-spam such as imbalance learning, as both data and its desired strategies have some inconsistency between false nega- label are present. In this paper, we are considering only tives (missed spam) and false positives (rejecting good two classifiers, random forest and bidirectional LSTM. emails) due to imbalance problems that act as an obsta- cle for most systems to make anti-spam systems suc- 2.4.1. Random forests cessful. Therefore, an adequate spam-filtering system that addresses imbalance issues is the prime demand for according to [29], random forests are a combination of web users. Recently, the authors in [19] presented an tree predictors such that each tree depends on the val- improved random forest for text classification that in- ues of a random vector sampled independently and with corporates bootstrapping and random subspace methods the same distribution for all trees in the forest. Random simultaneously and tested its performance on the SMS forests have similar hyperparameters to the Decision binary class dataset. The method removes inessential Tree. In addition, they have a very important hyper- features, adds some trees in the forest on each iteration, parameter, which is the number of estimators, i.e., the number of trees in the forest. 2.4.2. Bidirectional LSTM Hochreiter and Schmidhuber firstly proposed LSTM back in 1997 to overcome the gradient vanishing problem of RNN [30]. Its main idea is to introduce an adaptive gat- ing mechanism, which decides the degree to keep the previous state and memorize the extracted features of the current data input. LSTM models can recognize the relationship between values at the beginning and end of a sequence. For the sequence modeling tasks, it is beneficial to have access to the past and future contexts. By the end of 1997, Schussed and Palatal proposed BiL- STM to extend the unidirectional LSTM by introducing a second hidden layer, where the hidden to hidden connec- tions flow in the opposite temporal order. Therefore, the model can exploit information from both the past and the future, which can improve model performance on sequence classification problems. BiLSTM models are pri- marily used in natural language processing applications like text classification because BiLSTM is a powerful tool for modeling the sequential dependencies between the Figure 1: Data augmentation based on the proposed TAMS words and phrases in both directions of the sequence. 3. Text augmentation using most leading to biased performance estimates leading similar synonyms to disappointing models in the prediction phase. β’ Tokenization: this step aims to split each mes- The diagram displayed in Fig. 1 summarizes the integra- sage into a list of words, and this is necessary tion between the proposed text augmentation approach for two reasons, it is required to recognize the TAMS and the standard supervised learning training and stop word and remove them in the next step, and evaluation process. It starts with data cleaning and com- it is also a requirement to use the Word2Vec to mon NLP preprocessing steps. After that, the training compute a continuous vector representation for set is augmented by using TAMS. TAMS generates syn- each word in the message. onyms for each word, then filters them, and keeps the β’ Removal of stop words: stop words are the most similar synonyms to generate new messages. Then most frequent words in any language, such as ar- the augmented data will be used to train the chosen clas- ticles, prepositions, pronouns, and conjunctions. sifiers defined in Section 2.4 and evaluate their perfor- They do not add much information to the text. mance on the test set. Examples of stop words in English are words like the, a, an, so, what, and many more. Stop words 3.1. Data cleaning and preprocessing are available in abundance in any human lan- guage. By removing these words, we remove the To prepare the textual data for the model building we low-level information from the text to focus on performed the following text preprocessing steps: critical ones. β’ Duplicates removal: duplicate samples are prob- β’ Message representation: this step aims to com- lematic as when the same sample appears more pute a numerical representation for each message than once; it receives a disproportionate weight by computing a vector that represents the mes- during the training phase. Thus models that suc- sage simply by taking the average of vectors rep- ceed in recurring instances will look like they resenting each word in that message, where these perform well, while in reality, this is not the case. vectors are computed using the Word2Vec model. Additionally, duplicate samples can ruin the split The resulting vector will be the feature vector of between train, validation, and test sets in cases the message. Word2Vec [31] is a model that com- where identical entries are not all in the same set, putes continuous vector representations of words from large data sets. These word representations help establish the relationship between a word and the other similar meaning words through the created vector representations. Word2Vec models produce real-valued vectors, which allow the ma- chine learning algorithm to deal with the textual data, and at the same time, these vectors keep the semantic meaning of the represented words, where the similar meaning words are closer in Figure 3: Synonyms extraction and filtering process space, which indicates their semantic similarity. 3.2. Proposed text augmentation method main word, and each node of the first level of The proposed text augmentation method aims to gener- the tree represents synonyms. Each synonym is ate new spam messages based on the original ones by connected to the root via an edge with a weight replacing some words in the message with their most sim- representing its similarity. ilar synonyms. Fig. 2 summarizes the proposed TAMS β’ Finding most similar synonyms: In order to approach, starting from the preprocessed message text choose the most similar synonyms, every syn- tokens and it ends by generating a set of semantically onym is represented using the Word2Vec rep- similar messages. In the following subsections, we will resentation, and the cosine distance is used to explain each step in detail. measure the similarity between the word and its synonyms. The most similar synonyms are the synonyms that have a similitude greater than or equal to a predefined similarity threshold ππ . Fig. 3 shows the most similar synonyms for the word car in case ππ = 0.5. Optimizing ππ is essential as it explicitly im- pacts the classification performances. A grid search-like process is done to discover the opti- mal ππ by training multiple models using diverse augmented data according to candidate similar- ity thresholds. Candidate similarity thresholds are ππ = [0.625, 0.65, 0.675, 0.7, 0.725, 0.75]. These candidates are selected according to the augmented dataβs spam percentage ππ . For ex- ample, by using ππ = 0.625, TAMS will gen- erate data with an ππ = 55.98%, in this case, the augmented data is approximately balanced. However, lower values (ππ < 0.625) will yield an imbalanced data situation. For the highest candidate value i.e., ππ = 0.75, the correspond- ing spam percentage is 20.10%, and for higher Figure 2: Summary of TAMS approach. values(ππ > 0.75), the augmented data will be the same as the original data with a high imbal- ance. To determine the best threshold ππ , We β’ Synonyms extraction: extracting all possible calculate a rank π for each candidate as shown synonyms for each word in the sentence is done in Eq 1: with the help of the WordNet database. WordNet is a lexical database of semantic relations between π(ππΆ) + π(π΅π») + π(π πΆπΆ) + π(πΉ1 ) words introduced by [32]. It links words into se- π = 4 mantic relations, including synonyms, antonyms, (1) hyponyms, and other morphological relations. π is the mean average of Spam Caught (ππΆ), Fig. 3 shows synonyms of the word Car in a Blocked Ham (π΅π»), Matthews Correlation Coef- tree-like structure where the treeβs root is the ficient (π πΆπΆ), and F1 Score (πΉ1 ) ranks divided Table 1 Random forest and Bidirectional LSTM models with similarity threshold grid search results. classification model ππ ππ MCC SC BH πΉ1 R 0.625 55.98 0.8602 0.8321 0.99 0.8755 5 0.65 46.43 0.8358 0.7939 0.99 0.8525 3 0.675 40.63 0.8496 0.8015 0.77 0.8642 4.25 Random forest 0.7 32.09 0.8539 0.7939 0.55 0.8667 4.75 0.725 24.12 0.8295 0.7481 0.44 0.8412 3 0.75 20.10 0.7998 0.7023 0.44 0.8105 2.25 0.625 55.98 0.9296 0.9313 0.77 0.9384 3.5 0.65 46.43 0.9336 0.9237 0.55 0.9416 5.25 0.675 40.63 0.9332 0.9008 0.22 0.9402 4.5 Bidirectional LSTM 0.7 32.09 0.9196 0.8626 0.0 0.9262 2.25 0.725 24.12 0.9286 0.8931 0.22 0.936 2.75 0.75 20.10 0.9293 0.9237 0.66 0.938 3.5 by the number of metrics (these metrics are de- 4. Experiments and results fined in Section 4.3). We rank the scores in de- scending order for each metric between 6 and 1 (6 The code for the experimental part is done using Google is the number of candidate similarity thresholds ). Colab environment and is available via this link 1 We give 6 for the best metric score for a specific ππ and 1 for the worst metric score. 4.1. Dataset description Table 1 show the result of six random forest mod- els and six Bidirectional LSTM models. Each We used the SMS Spam Collection dataset [33] for the model was trained using a different augmented experimental part. The dataset has 5574 SMS messages training set according to the candidate similarity distributed as the following: 4825 Ham messages with a thresholds specified above. The best π for each percentage of 86.6 and 747 Spam messages with a percent- model is used in the experimental part. age of 13.4, which means that the SMS Spam Collection β’ Text augmentation: After specifying the most dataset has an imbalanced ratio of πΌπ = 6.46. πΌπ [34] similar synonyms for each word in a given mes- for binary classification problems is computed by Eq 2. sage, text augmentation is done by generating all β πΆπ ππ§π the possible combinations of the word-synonym πΌπ = + , π€βπππ πΌπ β₯ 1 (2) replacement and applying the replacement to the πΆπ ππ§π original message, where each replacement gen- where, πΆπ ππ§π β and πΆπ ππ§π + represents majority and minority erates a new message semantically similar to the class sizes respectively. original message. Table 2 shows an example of the TAMS approach with a threshold ππ = 0.6, where three new messages are generated. 4.2. Train test split In order to to make the evaluation process accurate and Table 2 realistic, the dataset was split in a stratified way, i.e., the Example of text augmentation with a message. spam messages percentage ππ and the ham messages parentage π»π are approximately the same in both the Original text Defer admission till next year training set and the testing set. The train set and test Postpone admission till next year set statistics are shown in Table 3. The test set contains Augmented text Defer admittance till next year 1034 messages with a percentage of 20% from the original Postpone admittance till next year dataset, while the train set contains 4135 messages. β’ Expending the training set size: Once the text 4.3. Evaluation Metrics augmentation is done, we append the generated textual data with the original training set and use Confusion matrix, Matthews Correlation Coefficient, the extended training set to train the classifiers. Spam Caught, Blocked Hams, and F1 score are used to 1 https://colab.research.google.com/drive/ 12LGx9j6OtEIadERJXJtmqaBaUaqVjz9S?usp=sharing Table 3 4.3.4. Spam Caught The train set and test set statistics. is equivalent to the True Positive Rate (TPR) or Recall, Set ππ % π»π % number of SMS and it means the number of the spam messages which Training 12.6 % 87.4 % 4135 are detected by the spam filter over the number of all Testing 12.7 % 87.3 % 1034 the spam messages,i.e. it is the measure of correctly identifying True Positives by the model, and it is defined by the Eq. 7 evaluate and compare the proposed text augmentation ππ method and measure the performance of SMS spam fil- ππΆ = (7) ππ + πΉπ ters. 4.3.5. Blocked Hams Table 4 Confusion Matrix. is equivalent to the False Positive Rate (FPR). A low score close to 0 is preferred as it reflects that we have few false C+ Cβ predictions. Furthermore, FPR is the number of legitimate C+ True Positives (TP) False Negatives (FN) messages which are classified as spam by the spam filter Cβ False Positives (FP) True Negatives (TN) over the number of all the legitimate messages, and it is defined in Eq. 8 πΉπ 4.3.1. Confusion Matrix π΅π» = (8) πΉπ + ππ is a technique for summarizing the prediction results of a classification model, Table. 4 well defines the Confusion 4.4. Experimental results Matrix (CM) for binary classification problems. 4.4.1. Random Forest 4.3.2. F1 score In this experiment, a random forest classifier with its hyperparameters: π_ππ π‘ππππ‘πππ = 200, is Precision-Recall trade-Off i.e. it combines the precision πππ_π ππππππ _π ππππ‘ = 20, πππ₯_π πππ‘π’πππ = 25, and and recall metrics into a single metric, and it is calculated πΊπππ criterion is used to implement the spam filter. using Eq. 3, F1 score has been designed to work well on The first part of the experiment depends only on the imbalanced data. original data i.e. the imbalanced data, Fig. 4a shows the π πππππ πππ Γ π πππππ confusion matrix which summaries the trained modelβs πΉ1 = 2 Γ (3) predictions on the test set, and Fig. 4b shows the results π πππππ πππ + π πππππ of the 10-folds cross-validation (10-CV) applied on the Where π πππππ πππ is computed by Eq 4: original training set, while Table 5 shows the resulting ππ evaluation measures based on the test set. π πππππ πππ = (4) ππ + πΉπ Figure 4: Confusion matrices of RF trained on original train- While the π πππππ is computed by Eq 5 ing data. ππ π πππππ = (5) ππ + πΉπ 4.3.3. Matthews Correlation Coefficient is used to measure to the quality of the binary classifica- tions, and this measure takes values in the range [-1,+1], where +1 means a perfect predication, 0 indicates the random prediction and -1 means an inverse prediction, (a) confusion matrix of RF (b) CM of RF model with 10- and it is given by Eq. 6 model tested on the test CV applied on the train- data. ing data. (π π Γ π π ) β (πΉ π Γ πΉ π ) π πΆπΆ = β (6) π In the second part of this experiment, a random for- Where: π = (π π + πΉ π ) Γ (π π + πΉ π ) Γ (π π + est model with the same hyperparameters is trained us- πΉ π ) Γ (π π + πΉ π ). ing the augmented data based on the proposed TAMS. As previously discussed, the optimal threshold ππ is Figure 6: Confusion matrices of two BiLSTM models, the first determined based on Table 1. Therefore, we choose model is trained using the original data, and the second model ππ = 0.625 as it has the highest rank π . Results of using the augmented data. TAMS-RF are displayed in Fig. 5a, which shows the confu- sion matrix obtained using the test set solely, and Fig. 5b shows the results of the 10-folds cross-validation applied on the augmented training set. Figure 5: Confusion matrices of TAMS-RF (a) CM of BiLSTM model (b) CM of BiLSTM model trained using the none- trained by augmented augmented dataset. data according to TAMS The confusion matrix in Fig. 6b shows a decrease in false predictions as πΉ π = 4, and πΉ π = 11. While (a) confusion matrix of (b) CM of TAMS-RF model Table 5 shows that the TAMS-BiLSTM model overcomes TAMS-RF model tested with 10-CV applied on on the test data. the training data. BiLSTM according to all the suggested evaluation metrics, and the improvements are as follows: MCC by 2%, Sc by 1.7 %, BH by 33%, F1-score by 1.76%. As Fig 4, Fig 5, and Table 5 shows, the model trained Both experiments proved that augmenting the training on the augmented data using the TAMS approach (TAMS- set by using the proposed TAMS approach helped the RF) performs better than the model trained on the original classification models to increase their ability to detect data solely in most used evaluation metrics. TAMS im- spam messages and not get biased toward the majority proved the MCC by 12.36%, SC by 30.96%, and πΉ1 -score class and that synthesizing new data from the existing by 13.67%, which means that the proposed text augmenta- ones is indeed an excellent alternative to data collection tion improved spam detection meaning that we reduced and annotation. the bias of RF toward the Ham class and improved the generalization ability of the models. We can conclude that TAMS generated data that has new linguistic pat- 5. Conclusion terns that are pertinent to the task and have not yet been seen in pre-training. However, the trained RF with the Nowadays, the spam filtering task is still a real challenge original data has a better BH score than TAMS-RF, but because most of the available datasets are imbalanced. that does not mean it is better than the second one. On Dealing with such non-standard derivative datasets is a the contrary, it means that the first model is biased to- common problem in classification tasks, especially in the ward the ham class and has fewer spam predictions, i.e.; spam filtering case. We proposed TAMS, a text augmen- it could not detect the spam messages properly. tation based on the most similar synonyms replacement to enhance the quality of supervised learning models and 4.4.2. Bidirectional LSTM solve the spam filtering problem. Experimental results showed that generating additional samples from the ex- In this experiment, a Bidirectional LSTM model with isting ones using TAMS added new linguistic patterns π΄πππ optimizer and πππππππ activation function at pertinent to the task and helped in improving the clas- the output layer is used to implement the spam filter. It sification performance of traditional classifiers like the was trained with a Batch size of 10 for ten epochs. f random forest and deep learning models like the Bidirec- Similarly to RF, the first part of the experiment depends tional LSTM. We can deduce that TAMS increased the only on the original data, Fig. 6a shows the confusion ability of both used models to identify spam messages, matrix with πΉ π = 6 and πΉ π = 13, and Table 5 shows reduce the bias toward the majority class, and improve the resulting values of evaluation metrics. While in the the trained modelsβ generalization ability. second part, a BiLSTM model with the same structure is In future work, we aim to improve TAMS further and trained using the augmented data generated by TAMS. enhance the quality of its generated data. Furthermore, For TAMS-BiLSTM, we used ππ = 0.65 as it has the we have to test our method on other textual datasets highest rank π (see Table 1). and compare it with other text augmentation methods to ensure that the proposed model is generic and not Table 5 Summary of the experimental results. Classification model Train set MCC SC BH πΉ1 OD 0.7697 0.6412 0.22 0.7742 Random forest TAMS 0.8648 0.8397 0.99 0.88 OD 0.9155 0.9007 0.66 0.9255 Bidirectional LSTM TAMS 0.9334 0.916 0.44 0.9418 specific to spam filtering exclusively. ing β IDEAL 2020, Springer International Publish- ing, Cham, 2020, pp. 54β65. [9] Z. Farou, S. Ouaari, B. Domian, T. HorvΓ‘th, Directed Acknowledgments undersampling using active learning for particle identification, in: Recent Innovations in Comput- This research is supported by the ΓNKP-21-3 New Na- ing, Springer, 2022, pp. 149β162. tional Excellence Program of the Ministry for Innovation [10] X. Bai, Y. Hu, P. Zhou, F. Shang, S. Shen, Data aug- and Technology from the source of the National Research, mentation imbalance for imbalanced attribute clas- Development and Innovation Fund. sification, arXiv preprint arXiv:2004.13628 (2020). [11] W. Zhang, X. Li, X.-D. Jia, H. Ma, Z. Luo, X. Li, Ma- References chinery fault diagnosis with imbalanced data using deep generative adversarial networks, Measure- [1] A. FernΓ‘ndez, S. GarcΓa, M. Galar, R. C. Prati, ment 152 (2020) 107377. B. Krawczyk, F. Herrera, Learning from imbalanced [12] W. Hao, F. Liu, Imbalanced data fault diagnosis data sets, volume 11, Springer, 2018. based on an evolutionary online sequential extreme [2] F. Thabtah, S. Hammoud, F. Kamalov, A. Gonsalves, learning machine, Symmetry 12 (2020) 1204. Data imbalance in classification: Experimental eval- [13] Y. Liu, H. T. Loh, A. Sun, Imbalanced text classifica- uation, Information Sciences 513 (2020) 429β441. tion: A term weighting approach, Expert systems [3] N. Malave, A. V. Nimkar, A survey on effects of with Applications 36 (2009) 690β701. class imbalance in data pre-processing stage of clas- [14] J. Jang, Y. Kim, K. Choi, S. Suh, Sequential tar- sification problem, International Journal of Com- geting: an incremental learning approach for data putational Systems Engineering 6 (2020) 63β75. imbalance in text classification, arXiv preprint [4] Q. Chen, A. Zhang, T. Huang, Q. He, Y. Song, Im- arXiv:2011.10216 (2020). balanced dataset-based echo state networks for [15] T. B. Shahi, A. Yadav, et al., Mobile sms spam filter- anomaly detection, Neural Computing and Ap- ing for nepali text using naΓ―ve bayesian and support plications 32 (2020) 3685β3694. vector machine, International Journal of Intelli- [5] P. Bermejo, J. A. GΓ‘mez, J. M. Puerta, Improving the gence Science 4 (2014) 24β28. performance of naive bayes multinomial in e-mail [16] S. Mohammed, O. Mohammed, J. Fiaidhi, S. Fong, foldering by introducing distribution-based balance T. H. Kim, Classifying unsolicited bulk email (ube) of datasets, Expert Systems with Applications 38 using python machine learning techniques, Inter- (2011) 2072β2080. national Journal of Hybrid Information Technology [6] D. Gan, J. Shen, B. An, M. Xu, N. Liu, Integrating 6 (2013) 43β56. tanbn with cost sensitive classification algorithm [17] V. K. Singh, S. Bhardwaj, Spam mail detection us- for imbalanced data in medical diagnosis, Comput- ing classification techniques and global training ers & Industrial Engineering 140 (2020) 106266. set, in: Intelligent Computing and Information and [7] M. Kinal, M. WoΕΊniak, Data preprocessing for des- Communication, Springer, 2018, pp. 623β632. knn and its application to imbalanced medical data [18] U. K. Sah, N. Parmar, An approach for malicious classification, in: Asian Conference on Intelligent spam detection in email with comparison of differ- Information and Database Systems, Springer, 2020, ent classifiers, International Research Journal of pp. 589β599. Engineering and Technology (IRJET) 4 (2017) 2238β [8] Z. Farou, N. Mouhoub, T. HorvΓ‘th, Data gen- 2242. eration using gene expression generator, in: [19] N. Jalal, A. Mehmood, G. S. Choi, I. Ashraf, A novel C. Analide, P. Novais, D. Camacho, H. Yin (Eds.), improved random forest for text classification us- Intelligent Data Engineering and Automated Learn- ing feature ranking and optimal number of trees, Journal of King Saud University-Computer and In- classification in r, Knowledge-Based Systems 161 formation Sciences (2022). (2018) 329β341. [20] C. Mi, L. Xie, Y. Zhang, Improving data augmen- tation for low resource speech-to-text translation with diverse paraphrasing, Neural Networks 148 (2022) 194β205. [21] M. Kim, P. Kang, Text embedding augmentation based on retraining with pseudo-labeled adversarial embedding, IEEE Access 10 (2022) 8363β8376. [22] S. Bonthu, A. Dayal, M. Lakshmi, S. Rama Sree, Effective text augmentation strategy for nlp models, in: Proceedings of Third International Conference on Sustainable Computing, Springer, 2022, pp. 521β 531. [23] C. Coulombe, Text data augmentation made simple by leveraging NLP cloud apis, CoRR abs/1812.04718 (2018). URL: http://arxiv.org/abs/ 1812.04718. arXiv:1812.04718. [24] S. Qiu, B. Xu, J. Zhang, Y. Wang, X. Shen, G. de Melo, C. Long, X. Li, EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks, As- sociation for Computing Machinery, New York, NY, USA, 2020, p. 249β252. URL: https://doi.org/10.1145/ 3366424.3383552. [25] Z. Feng, H. Zhou, Z. Zhu, K. Mao, Tailored text aug- mentation for sentiment analysis, Expert Systems with Applications (2022) 117605. [26] R. Xiang, E. Chersoni, Q. Lu, C.-R. Huang, W. Li, Y. Long, Lexical data augmentation for sentiment analysis, Journal of the Association for Information Science and Technology 72 (2021) 1432β1447. [27] D. T. Vu, G. Yu, C. Lee, J. Kim, Text data augmenta- tion for the korean language, Applied Sciences 12 (2022) 3425. [28] A. Taan, Z. Farou, Supervised learning methods for skin segmentation based on pixel color classifica- tion, Central-European Journal of New Technolo- gies in Research, Education and Practice (2021). [29] L. Breiman, Random forests, Machine learning 45 (2001) 5β32. [30] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015). [31] K. W. Church, Word2vec, Natural Language Engi- neering 23 (2017) 155β162. [32] C. Fellbaum, Wordnet, in: Theory and applications of ontology: computer applications, Springer, 2010, pp. 231β243. [33] J. M. G. Hidalgo, T. A. Almeida, A. Yamakami, On the validity of a new sms spam collection, in: 2012 11th International Conference on Machine Learning and Applications, volume 2, IEEE, 2012, pp. 240β 245. [34] I. CordΓ³n, S. GarcΓa, A. FernΓ‘ndez, F. Herrera, Im- balance: oversampling algorithms for imbalanced