=Paper=
{{Paper
|id=Vol-3226/paper4
|storemode=property
|title=TAMS: Text Augmentation using Most Similar Synonyms for SMS Spam Filtering
|pdfUrl=https://ceur-ws.org/Vol-3226/paper4.pdf
|volume=Vol-3226
|authors=Mohammad Qussai Jouban,Zakarya Farou
|dblpUrl=https://dblp.org/rec/conf/itat/JoubanF22
}}
==TAMS: Text Augmentation using Most Similar Synonyms for SMS Spam Filtering==
<pdf width="1500px">https://ceur-ws.org/Vol-3226/paper4.pdf</pdf>
<pre>
TAMS: Text Augmentation using Most Similar Synonyms for
SMS Spam Filtering
Mohammad Qussai Jouban, Zakarya Farou
ELTE Eötvös Loránd University, Department of Data Science and Engineering, Institute of Industry - Academia Innovation, Budapest, Hungary


                                        Abstract
                                        Spam filtering is a non-standard derivative data science problem aiming to catch unsolicited and undesirable messages and
                                        prevent those messages from reaching a user’s inbox. To solve the abovementioned problem, we propose a text augmentation
                                        approach using the most similar synonyms called TAMS. We used Random forest and Bidirectional LSTM classification
                                        models for the experimental part to assess the proposed approach. The results indicate that training the classifiers with
                                        synthesized spam messages generated by TAMS reduces the influence of the imbalance problem present by nature in the
                                        dataset and improves the overall performance of the classification models. Hence, this study shows the potential of using
                                        TAMS to enhance the classification performance on textual data where the imbalance scenario is present.

                                        Keywords
                                        Spam filtering, Text classification, Text Augmentation, Imbalance learning


1. Introduction                                              minority class. The most common solution to the im-
                                                             balanced datasets problem is generating new samples
Spam filtering is a non-standard derivative data science belonging to the minority class to make the dataset bal-
problem [1]. Derivative because it is an extension of anced.
core problems, i.e., classification problems. Non-standard,    The paper is organized as follows: Section 2 describes
since the data has an unusual distribution on the target the imbalance problem, data augmentation, used ma-
variable, such problems belong to the imbalance problems chine learning models, some related terminologies and
family. Spam filtering is also one of the most common related works. Section 3 introduces the proposed text
problems in the Natural Language Processing domain. augmentation approach TAMS with a detailed explana-
The main target of this problem is to identify the spam tion and practical examples. The experimental results,
messages and filter them out from the legitimate mes- including the dataset, evaluation metrics, and results with
sages. Solving this problem will be very beneficial for discussion, are presented in Section 4. Lastly, Section 5
telecommunication companies, where text messaging is outlines the conclusion about the conducted research and
the most common non-voice use of a mobile phone. In the potential research direction to improve learning and
fact, according to security firm Cloudmark, about 30 mil- classification of similar problems.
lion spam messages are sent to cell phone users across
North America, Europe, and the U.K.
   This study aims to improve the SMS spam filtering 2. Background
by solving the most common problem in the available
datasets, which is the imbalance problem, where most of Data science techniques are used to solve many prob-
the samples belong to the legitimate class, i.e., legitimate lems. These methods can learn and likely extract hidden
messages, the so-called majority class 𝐶 − , and a small patterns from the data used as input. Regardless of data
proportion of the samples belongs to the spam class, i.e., modality (e.g., textual, visual, tabular), we classify data
spam messages, so-called minority class 𝐶 + . Dealing science problems into standard and non-standard prob-
with an imbalanced dataset is one of the main challenges lems. Standards problems mainly concern supervised
in machine learning, especially in classification problems, learning (predictive problems) and unsupervised learn-
where most well-known classification models tend to be ing (descriptive problems). However, there exist more
biased toward the majority class and fail to identify the complex (non-standard) problems than the cited ones.
                                                             These complex problems are derived or hybridized from
                                                             the standard, i.e., core problems.
ITAT’22: Information technologies – Applications and Theory, Septem-
ber 23–27, 2022, Zuberec, Slovakia
*
  Zakarya Farou.                                                                                           2.1. Imbalance problem
$ f7rg3j@@inf.elte.hu (M. Q. Jouban); zakaryafarou@inf.elte.hu
(Z. Farou)                                                                                                                       Imbalance problem occurs when the target class has very
 0000-0003-3996-2656 (Z. Farou)                                                                                                 few samples opposed to the other classes, It is classified
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).                                                   as a non-standard derivative problem.
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
   Technically, we call a dataset imbalanced regardless of      and monitors the classification performance of RF.
its data modality when there is a disproportion among
the number of instances of each class, making classes           2.3. Data augmentation for textual data
under-represented. Therefore, traditional machine learn-
ing (ML) algorithms have complications defining the tar-        Obtaining accurate results while training a classifier be-
get class’s decision boundaries.                                comes difficult due to the lack of available, varied, and
   As various real-world applications and diverse do-           meaningful data, especially when the imbalance prob-
mains fall under the imbalance problem, researches on           lem is present. Therefore, additional samples should be
imbalanced data classification have expanded and gained         added to train the classifier more efficiently. However,
more interest [2]. Mainly, we face the imbalance problem        gathering such data is time-consuming and needs domain
in credit card fraud detection [3], anomaly detection [4],      experts that examine and annotate the data.
e-mail foldering [5], medical diagnosis [6] [7] [8], parti-        As assembling such data is costly, synthesizing new
cles identification [9], face recognition [10], fault diag-     data from the existing ones seems to be a promising ap-
nosis [11] [12], text classification [13] [14], and many        proach, specifically if the quality of the generated data is
others.                                                         as good as the original one. In the data science ecosys-
                                                                tem, increasing the training dataset, i.e., generating addi-
                                                                tional samples from the existing ones, is known as data
2.2. E-Mail Spam Filtering
                                                                augmentation. For an imbalanced class problem, data
Spam emails, also known as junk emails, are messages            augmentation would help avoid overfitting, reduce the
transmitted by spammers via email. Users are con-               bias of the classifiers toward the majority class, and im-
fronting several issues such as the abuse of traffic, limited   proving the generalization ability of the trained models.
storage space, computational power, waste of users’ time,       However, text augmentation would only be worthwhile
and threat to user security. Therefore, appropriate email       if the generated data has new linguistic patterns that
filtering is essential to provide more security and increase    are pertinent to the task and have not yet been seen in
the efficacy of end users. Data Scientists conducted sev-       pre-training.
eral types of research on email filtering; some achieved           In NLP, there are numerous data augmentation tech-
good accuracy, and some continued. For instance, in [15],       niques, such as paraphrasing [20], close embeddings [21],
the authors developed a mobile SMS spam filtering for           swapping [22], inducing spelling mistakes [23], delet-
Nepali text and used Naïve bayesian and support vec-            ing [24], and synonyms replacement [25][26][27].
tor machines as classifiers, while in [16], Mohammed et            For this study, we mainly focus on local data augmen-
al. present an approach for filtering spam email using          tation, particularly token substitution, because it is a
machine learning algorithms. At first, they used the to-        cost-effective and easily accessible yet powerful textual
kenization method to filter spam and ham words from             data augmentation method. Token substitution is a pop-
the training data and utilized them to create testing and       ular method that replaces a token in the sentence with
training tables and experienced with various data min-          its synonym.
ing algorithms. Furthermore, Singh et al. [17] discussed
the solution and classification process of spam filtering       2.4. Supervised machine learning
and presented a combining classification technique to get
better spam filtering results. Other studies such as [18]       Machine learning [28] is a form of artificial intelligence
proposed a method for detecting malicious spam through          that enables a system to learn from data rather than
feature selection and improving the training time and           through explicit programming. We can use supervised
accuracy of a malicious spam detection system.                  learning algorithms for non-standard derivative problems
    Despite the numerous proposals, most anti-spam              such as imbalance learning, as both data and its desired
strategies have some inconsistency between false nega-          label are present. In this paper, we are considering only
tives (missed spam) and false positives (rejecting good         two classifiers, random forest and bidirectional LSTM.
emails) due to imbalance problems that act as an obsta-
cle for most systems to make anti-spam systems suc-             2.4.1. Random forests
cessful. Therefore, an adequate spam-filtering system
that addresses imbalance issues is the prime demand for         according to [29], random forests are a combination of
web users. Recently, the authors in [19] presented an           tree predictors such that each tree depends on the val-
improved random forest for text classification that in-         ues of a random vector sampled independently and with
corporates bootstrapping and random subspace methods            the same distribution for all trees in the forest. Random
simultaneously and tested its performance on the SMS            forests have similar hyperparameters to the Decision
binary class dataset. The method removes inessential            Tree. In addition, they have a very important hyper-
features, adds some trees in the forest on each iteration,
parameter, which is the number of estimators, i.e., the
number of trees in the forest.

2.4.2. Bidirectional LSTM
Hochreiter and Schmidhuber firstly proposed LSTM back
in 1997 to overcome the gradient vanishing problem of
RNN [30]. Its main idea is to introduce an adaptive gat-
ing mechanism, which decides the degree to keep the
previous state and memorize the extracted features of
the current data input. LSTM models can recognize the
relationship between values at the beginning and end
of a sequence. For the sequence modeling tasks, it is
beneficial to have access to the past and future contexts.
By the end of 1997, Schussed and Palatal proposed BiL-
STM to extend the unidirectional LSTM by introducing a
second hidden layer, where the hidden to hidden connec-
tions flow in the opposite temporal order. Therefore, the
model can exploit information from both the past and
the future, which can improve model performance on
sequence classification problems. BiLSTM models are pri-
marily used in natural language processing applications
like text classification because BiLSTM is a powerful tool
for modeling the sequential dependencies between the Figure 1: Data augmentation based on the proposed TAMS
words and phrases in both directions of the sequence.


3. Text augmentation using most                                  leading to biased performance estimates leading
   similar synonyms                                              to disappointing models in the prediction phase.
                                                               • Tokenization: this step aims to split each mes-
The diagram displayed in Fig. 1 summarizes the integra-          sage into a list of words, and this is necessary
tion between the proposed text augmentation approach             for two reasons, it is required to recognize the
TAMS and the standard supervised learning training and           stop word and remove them in the next step, and
evaluation process. It starts with data cleaning and com-        it is also a requirement to use the Word2Vec to
mon NLP preprocessing steps. After that, the training            compute a continuous vector representation for
set is augmented by using TAMS. TAMS generates syn-              each word in the message.
onyms for each word, then filters them, and keeps the          • Removal of stop words: stop words are the
most similar synonyms to generate new messages. Then             most frequent words in any language, such as ar-
the augmented data will be used to train the chosen clas-        ticles, prepositions, pronouns, and conjunctions.
sifiers defined in Section 2.4 and evaluate their perfor-        They do not add much information to the text.
mance on the test set.                                           Examples of stop words in English are words like
                                                                 the, a, an, so, what, and many more. Stop words
3.1. Data cleaning and preprocessing                             are available in abundance in any human lan-
                                                                 guage. By removing these words, we remove the
To prepare the textual data for the model building we            low-level information from the text to focus on
performed the following text preprocessing steps:                critical ones.
     • Duplicates removal: duplicate samples are prob-         • Message representation: this step aims to com-
       lematic as when the same sample appears more              pute a numerical representation for each message
       than once; it receives a disproportionate weight          by computing a vector that represents the mes-
       during the training phase. Thus models that suc-          sage simply by taking the average of vectors rep-
       ceed in recurring instances will look like they           resenting each word in that message, where these
       perform well, while in reality, this is not the case.     vectors are computed using the Word2Vec model.
       Additionally, duplicate samples can ruin the split        The resulting vector will be the feature vector of
       between train, validation, and test sets in cases         the message. Word2Vec [31] is a model that com-
       where identical entries are not all in the same set,      putes continuous vector representations of words
       from large data sets. These word representations
       help establish the relationship between a word
       and the other similar meaning words through the
       created vector representations. Word2Vec models
       produce real-valued vectors, which allow the ma-
       chine learning algorithm to deal with the textual
       data, and at the same time, these vectors keep
       the semantic meaning of the represented words,
       where the similar meaning words are closer in           Figure 3: Synonyms extraction and filtering process
       space, which indicates their semantic similarity.

3.2. Proposed text augmentation method                         main word, and each node of the first level of
The proposed text augmentation method aims to gener-           the tree represents synonyms. Each synonym is
ate new spam messages based on the original ones by            connected to the root via an edge with a weight
replacing some words in the message with their most sim-       representing its similarity.
ilar synonyms. Fig. 2 summarizes the proposed TAMS           • Finding most similar synonyms: In order to
approach, starting from the preprocessed message text          choose the most similar synonyms, every syn-
tokens and it ends by generating a set of semantically         onym is represented using the Word2Vec rep-
similar messages. In the following subsections, we will        resentation, and the cosine distance is used to
explain each step in detail.                                   measure the similarity between the word and its
                                                               synonyms.
                                                               The most similar synonyms are the synonyms
                                                               that have a similitude greater than or equal to a
                                                               predefined similarity threshold 𝑆𝑇 . Fig. 3 shows
                                                               the most similar synonyms for the word car in
                                                               case 𝑆𝑇 = 0.5.
                                                               Optimizing 𝑆𝑇 is essential as it explicitly im-
                                                               pacts the classification performances. A grid
                                                               search-like process is done to discover the opti-
                                                               mal 𝑆𝑇 by training multiple models using diverse
                                                               augmented data according to candidate similar-
                                                               ity thresholds. Candidate similarity thresholds
                                                               are 𝑆𝑇 = [0.625, 0.65, 0.675, 0.7, 0.725, 0.75].
                                                               These candidates are selected according to the
                                                               augmented data’s spam percentage 𝑆𝑃 . For ex-
                                                               ample, by using 𝑆𝑇 = 0.625, TAMS will gen-
                                                               erate data with an 𝑆𝑃 = 55.98%, in this case,
                                                               the augmented data is approximately balanced.
                                                               However, lower values (𝑆𝑇 < 0.625) will yield
                                                               an imbalanced data situation. For the highest
                                                               candidate value i.e., 𝑆𝑇 = 0.75, the correspond-
                                                               ing spam percentage is 20.10%, and for higher
Figure 2: Summary of TAMS approach.                            values(𝑆𝑇 > 0.75), the augmented data will be
                                                               the same as the original data with a high imbal-
                                                               ance. To determine the best threshold 𝑆𝑇 , We
     • Synonyms extraction: extracting all possible            calculate a rank 𝑅 for each candidate as shown
       synonyms for each word in the sentence is done          in Eq 1:
       with the help of the WordNet database. WordNet
       is a lexical database of semantic relations between             𝑟(𝑆𝐶) + 𝑟(𝐵𝐻) + 𝑟(𝑀 𝐶𝐶) + 𝑟(𝐹1 )
       words introduced by [32]. It links words into se-        𝑅=
                                                                                        4
       mantic relations, including synonyms, antonyms,                                                       (1)
       hyponyms, and other morphological relations.            𝑅 is the mean average of Spam Caught (𝑆𝐶),
       Fig. 3 shows synonyms of the word Car in a              Blocked Ham (𝐵𝐻), Matthews Correlation Coef-
       tree-like structure where the tree’s root is the        ficient (𝑀 𝐶𝐶), and F1 Score (𝐹1 ) ranks divided
Table 1
Random forest and Bidirectional LSTM models with similarity threshold grid search results.

                   classification model    𝑆𝑇      𝑆𝑃        MCC           SC      BH       𝐹1        R
                                          0.625    55.98     0.8602      0.8321   0.99    0.8755     5
                                          0.65     46.43     0.8358      0.7939   0.99    0.8525      3
                                          0.675    40.63     0.8496      0.8015   0.77    0.8642    4.25
                      Random forest
                                           0.7     32.09     0.8539      0.7939   0.55    0.8667    4.75
                                          0.725    24.12     0.8295      0.7481   0.44    0.8412      3
                                          0.75     20.10     0.7998      0.7023   0.44    0.8105    2.25
                                          0.625    55.98     0.9296      0.9313    0.77   0.9384     3.5
                                          0.65     46.43     0.9336      0.9237    0.55   0.9416    5.25
                                          0.675    40.63     0.9332      0.9008    0.22   0.9402     4.5
                   Bidirectional LSTM
                                           0.7     32.09     0.9196      0.8626    0.0    0.9262    2.25
                                          0.725    24.12     0.9286      0.8931    0.22    0.936    2.75
                                          0.75     20.10     0.9293      0.9237    0.66    0.938     3.5


       by the number of metrics (these metrics are de-         4. Experiments and results
       fined in Section 4.3). We rank the scores in de-
       scending order for each metric between 6 and 1 (6       The code for the experimental part is done using Google
       is the number of candidate similarity thresholds ).     Colab environment and is available via this link 1
       We give 6 for the best metric score for a specific
       𝑆𝑇 and 1 for the worst metric score.                    4.1. Dataset description
       Table 1 show the result of six random forest mod-
       els and six Bidirectional LSTM models. Each             We used the SMS Spam Collection dataset [33] for the
       model was trained using a different augmented           experimental part. The dataset has 5574 SMS messages
       training set according to the candidate similarity      distributed as the following: 4825 Ham messages with a
       thresholds specified above. The best 𝑅 for each         percentage of 86.6 and 747 Spam messages with a percent-
       model is used in the experimental part.                 age of 13.4, which means that the SMS Spam Collection
     • Text augmentation: After specifying the most            dataset has an imbalanced ratio of 𝐼𝑅 = 6.46. 𝐼𝑅 [34]
       similar synonyms for each word in a given mes-          for binary classification problems is computed by Eq 2.
       sage, text augmentation is done by generating all                                 −
                                                                                        𝐶𝑠𝑖𝑧𝑒
       the possible combinations of the word-synonym                            𝐼𝑅 =     + , 𝑤ℎ𝑒𝑟𝑒 𝐼𝑅 ≥ 1                  (2)
       replacement and applying the replacement to the                                  𝐶𝑠𝑖𝑧𝑒
       original message, where each replacement gen-           where, 𝐶𝑠𝑖𝑧𝑒
                                                                         −
                                                                              and 𝐶𝑠𝑖𝑧𝑒
                                                                                    +
                                                                                         represents majority and minority
       erates a new message semantically similar to the        class sizes respectively.
       original message. Table 2 shows an example of
       the TAMS approach with a threshold 𝑆𝑇 = 0.6,
       where three new messages are generated.
                                                               4.2. Train test split
                                                               In order to to make the evaluation process accurate and
   Table 2                                                     realistic, the dataset was split in a stratified way, i.e., the
   Example of text augmentation with a message.                spam messages percentage 𝑆𝑝 and the ham messages
                                                               parentage 𝐻𝑝 are approximately the same in both the
   Original text        Defer admission till next year         training set and the testing set. The train set and test
                      Postpone admission till next year        set statistics are shown in Table 3. The test set contains
  Augmented text       Defer admittance till next year         1034 messages with a percentage of 20% from the original
                     Postpone admittance till next year        dataset, while the train set contains 4135 messages.

     • Expending the training set size: Once the text          4.3. Evaluation Metrics
       augmentation is done, we append the generated
       textual data with the original training set and use     Confusion matrix, Matthews Correlation Coefficient,
       the extended training set to train the classifiers.     Spam Caught, Blocked Hams, and F1 score are used to
                                                               1
                                                                   https://colab.research.google.com/drive/
                                                                   12LGx9j6OtEIadERJXJtmqaBaUaqVjz9S?usp=sharing
Table 3                                                               4.3.4. Spam Caught
The train set and test set statistics.
                                                                      is equivalent to the True Positive Rate (TPR) or Recall,
          Set        𝑆𝑝 %      𝐻𝑝 %        number of SMS              and it means the number of the spam messages which
       Training     12.6 %     87.4 %            4135                 are detected by the spam filter over the number of all
       Testing      12.7 %     87.3 %            1034                 the spam messages,i.e. it is the measure of correctly
                                                                      identifying True Positives by the model, and it is defined
                                                                      by the Eq. 7
evaluate and compare the proposed text augmentation                                                 𝑇𝑃
method and measure the performance of SMS spam fil-                                       𝑆𝐶 =                                  (7)
                                                                                                  𝑇𝑃 + 𝐹𝑁
ters.
                                                                      4.3.5. Blocked Hams
Table 4
Confusion Matrix.                                                     is equivalent to the False Positive Rate (FPR). A low score
                                                                      close to 0 is preferred as it reflects that we have few false
                     C+                          C−
                                                                      predictions. Furthermore, FPR is the number of legitimate
     C+      True Positives (TP)         False Negatives (FN)         messages which are classified as spam by the spam filter
     C−      False Positives (FP)        True Negatives (TN)          over the number of all the legitimate messages, and it is
                                                                      defined in Eq. 8
                                                                                                    𝐹𝑃
4.3.1. Confusion Matrix                                                                  𝐵𝐻 =                                   (8)
                                                                                                  𝐹𝑃 + 𝑇𝑁
is a technique for summarizing the prediction results of a
classification model, Table. 4 well defines the Confusion             4.4. Experimental results
Matrix (CM) for binary classification problems.
                                                                      4.4.1. Random Forest

4.3.2. F1 score                                                       In this experiment, a random forest classifier
                                                                      with its hyperparameters: 𝑛_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟𝑠 = 200,
is Precision-Recall trade-Off i.e. it combines the precision          𝑚𝑖𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠_𝑠𝑝𝑙𝑖𝑡 = 20, 𝑚𝑎𝑥_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 25, and
and recall metrics into a single metric, and it is calculated         𝐺𝑖𝑛𝑖 criterion is used to implement the spam filter.
using Eq. 3, F1 score has been designed to work well on                  The first part of the experiment depends only on the
imbalanced data.                                                      original data i.e. the imbalanced data, Fig. 4a shows the
                    𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙                               confusion matrix which summaries the trained model’s
                𝐹1 = 2 ×                                        (3)   predictions on the test set, and Fig. 4b shows the results
                    𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
                                                                      of the 10-folds cross-validation (10-CV) applied on the
  Where 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is computed by Eq 4:                               original training set, while Table 5 shows the resulting
                                      𝑇𝑃                              evaluation measures based on the test set.
                  𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                  (4)
                                    𝑇𝑃 + 𝐹𝑃
                                                                      Figure 4: Confusion matrices of RF trained on original train-
While the 𝑅𝑒𝑐𝑎𝑙𝑙 is computed by Eq 5                                  ing data.

                                  𝑇𝑃
                    𝑅𝑒𝑐𝑎𝑙𝑙 =                                    (5)
                                𝑇𝑃 + 𝐹𝑁

4.3.3. Matthews Correlation Coefficient
is used to measure to the quality of the binary classifica-
tions, and this measure takes values in the range [-1,+1],
where +1 means a perfect predication, 0 indicates the
random prediction and -1 means an inverse prediction, (a) confusion matrix of RF                       (b) CM of RF model with 10-
and it is given by Eq. 6                                    model tested on the test                       CV applied on the train-
                                                                         data.                             ing data.
                (𝑇 𝑃 × 𝑇 𝑁 ) − (𝐹 𝑃 × 𝐹 𝑁 )
         𝑀 𝐶𝐶 =            √                                    (6)
                             𝑊                    In the second part of this experiment, a random for-
Where: 𝑊 = (𝑇 𝑃 + 𝐹 𝑃 ) × (𝑇 𝑃 + 𝐹 𝑁 ) × (𝑇 𝑁 + est model with the same hyperparameters is trained us-
𝐹 𝑃 ) × (𝑇 𝑁 + 𝐹 𝑁 ).                           ing the augmented data based on the proposed TAMS.
As previously discussed, the optimal threshold 𝑆𝑇 is Figure 6: Confusion matrices of two BiLSTM models, the first
determined based on Table 1. Therefore, we choose model is trained using the original data, and the second model
𝑆𝑇 = 0.625 as it has the highest rank 𝑅. Results of using the augmented data.
TAMS-RF are displayed in Fig. 5a, which shows the confu-
sion matrix obtained using the test set solely, and Fig. 5b
shows the results of the 10-folds cross-validation applied
on the augmented training set.

Figure 5: Confusion matrices of TAMS-RF


                                                            (a) CM of BiLSTM model          (b) CM of BiLSTM model
                                                                trained using the none-         trained by augmented
                                                                augmented dataset.              data according to TAMS


                                                                The confusion matrix in Fig. 6b shows a decrease in
                                                             false predictions as 𝐹 𝑃 = 4, and 𝐹 𝑁 = 11. While
(a) confusion matrix of            (b) CM of TAMS-RF model
                                                             Table 5 shows that the TAMS-BiLSTM model overcomes
    TAMS-RF model tested               with 10-CV applied on
    on the test data.                  the training data.
                                                             BiLSTM according to all the suggested evaluation metrics,
                                                             and the improvements are as follows: MCC by 2%, Sc by
                                                             1.7 %, BH by 33%, F1-score by 1.76%.
   As Fig 4, Fig 5, and Table 5 shows, the model trained
                                                                Both experiments proved that augmenting the training
on the augmented data using the TAMS approach (TAMS-
                                                             set by using the proposed TAMS approach helped the
RF) performs better than the model trained on the original
                                                             classification models to increase their ability to detect
data solely in most used evaluation metrics. TAMS im-
                                                             spam messages and not get biased toward the majority
proved the MCC by 12.36%, SC by 30.96%, and 𝐹1 -score
                                                             class and that synthesizing new data from the existing
by 13.67%, which means that the proposed text augmenta-
                                                             ones is indeed an excellent alternative to data collection
tion improved spam detection meaning that we reduced
                                                             and annotation.
the bias of RF toward the Ham class and improved the
generalization ability of the models. We can conclude
that TAMS generated data that has new linguistic pat- 5. Conclusion
terns that are pertinent to the task and have not yet been
seen in pre-training. However, the trained RF with the Nowadays, the spam filtering task is still a real challenge
original data has a better BH score than TAMS-RF, but because most of the available datasets are imbalanced.
that does not mean it is better than the second one. On Dealing with such non-standard derivative datasets is a
the contrary, it means that the first model is biased to- common problem in classification tasks, especially in the
ward the ham class and has fewer spam predictions, i.e.; spam filtering case. We proposed TAMS, a text augmen-
it could not detect the spam messages properly.              tation based on the most similar synonyms replacement
                                                             to enhance the quality of supervised learning models and
4.4.2. Bidirectional LSTM                                    solve the spam filtering problem. Experimental results
                                                             showed that generating additional samples from the ex-
In this experiment, a Bidirectional LSTM model with isting ones using TAMS added new linguistic patterns
𝐴𝑑𝑎𝑚 optimizer and 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 activation function at pertinent to the task and helped in improving the clas-
the output layer is used to implement the spam filter. It sification performance of traditional classifiers like the
was trained with a Batch size of 10 for ten epochs. f        random forest and deep learning models like the Bidirec-
   Similarly to RF, the first part of the experiment depends tional LSTM. We can deduce that TAMS increased the
only on the original data, Fig. 6a shows the confusion ability of both used models to identify spam messages,
matrix with 𝐹 𝑃 = 6 and 𝐹 𝑁 = 13, and Table 5 shows reduce the bias toward the majority class, and improve
the resulting values of evaluation metrics. While in the the trained models’ generalization ability.
second part, a BiLSTM model with the same structure is          In future work, we aim to improve TAMS further and
trained using the augmented data generated by TAMS. enhance the quality of its generated data. Furthermore,
For TAMS-BiLSTM, we used 𝑆𝑇 = 0.65 as it has the we have to test our method on other textual datasets
highest rank 𝑅 (see Table 1).                                and compare it with other text augmentation methods
                                                             to ensure that the proposed model is generic and not
Table 5
Summary of the experimental results.

                        Classification model   Train set     MCC        SC        BH       𝐹1
                                                 OD          0.7697   0.6412     0.22     0.7742
                          Random forest
                                                TAMS         0.8648   0.8397     0.99      0.88
                                                 OD          0.9155   0.9007     0.66     0.9255
                        Bidirectional LSTM
                                                TAMS         0.9334   0.916      0.44     0.9418


specific to spam filtering exclusively.                            ing – IDEAL 2020, Springer International Publish-
                                                                   ing, Cham, 2020, pp. 54–65.
                                                               [9] Z. Farou, S. Ouaari, B. Domian, T. Horváth, Directed
Acknowledgments                                                    undersampling using active learning for particle
                                                                   identification, in: Recent Innovations in Comput-
This research is supported by the ÚNKP-21-3 New Na-
                                                                   ing, Springer, 2022, pp. 149–162.
tional Excellence Program of the Ministry for Innovation
                                                              [10] X. Bai, Y. Hu, P. Zhou, F. Shang, S. Shen, Data aug-
and Technology from the source of the National Research,
                                                                   mentation imbalance for imbalanced attribute clas-
Development and Innovation Fund.
                                                                   sification, arXiv preprint arXiv:2004.13628 (2020).
                                                              [11] W. Zhang, X. Li, X.-D. Jia, H. Ma, Z. Luo, X. Li, Ma-
References                                                         chinery fault diagnosis with imbalanced data using
                                                                   deep generative adversarial networks, Measure-
 [1] A. Fernández, S. García, M. Galar, R. C. Prati,               ment 152 (2020) 107377.
     B. Krawczyk, F. Herrera, Learning from imbalanced        [12] W. Hao, F. Liu, Imbalanced data fault diagnosis
     data sets, volume 11, Springer, 2018.                         based on an evolutionary online sequential extreme
 [2] F. Thabtah, S. Hammoud, F. Kamalov, A. Gonsalves,             learning machine, Symmetry 12 (2020) 1204.
     Data imbalance in classification: Experimental eval-     [13] Y. Liu, H. T. Loh, A. Sun, Imbalanced text classifica-
     uation, Information Sciences 513 (2020) 429–441.              tion: A term weighting approach, Expert systems
 [3] N. Malave, A. V. Nimkar, A survey on effects of               with Applications 36 (2009) 690–701.
     class imbalance in data pre-processing stage of clas-    [14] J. Jang, Y. Kim, K. Choi, S. Suh, Sequential tar-
     sification problem, International Journal of Com-             geting: an incremental learning approach for data
     putational Systems Engineering 6 (2020) 63–75.                imbalance in text classification, arXiv preprint
 [4] Q. Chen, A. Zhang, T. Huang, Q. He, Y. Song, Im-              arXiv:2011.10216 (2020).
     balanced dataset-based echo state networks for           [15] T. B. Shahi, A. Yadav, et al., Mobile sms spam filter-
     anomaly detection, Neural Computing and Ap-                   ing for nepali text using naïve bayesian and support
     plications 32 (2020) 3685–3694.                               vector machine, International Journal of Intelli-
 [5] P. Bermejo, J. A. Gámez, J. M. Puerta, Improving the          gence Science 4 (2014) 24–28.
     performance of naive bayes multinomial in e-mail         [16] S. Mohammed, O. Mohammed, J. Fiaidhi, S. Fong,
     foldering by introducing distribution-based balance           T. H. Kim, Classifying unsolicited bulk email (ube)
     of datasets, Expert Systems with Applications 38              using python machine learning techniques, Inter-
     (2011) 2072–2080.                                             national Journal of Hybrid Information Technology
 [6] D. Gan, J. Shen, B. An, M. Xu, N. Liu, Integrating            6 (2013) 43–56.
     tanbn with cost sensitive classification algorithm       [17] V. K. Singh, S. Bhardwaj, Spam mail detection us-
     for imbalanced data in medical diagnosis, Comput-             ing classification techniques and global training
     ers & Industrial Engineering 140 (2020) 106266.               set, in: Intelligent Computing and Information and
 [7] M. Kinal, M. Woźniak, Data preprocessing for des-             Communication, Springer, 2018, pp. 623–632.
     knn and its application to imbalanced medical data       [18] U. K. Sah, N. Parmar, An approach for malicious
     classification, in: Asian Conference on Intelligent           spam detection in email with comparison of differ-
     Information and Database Systems, Springer, 2020,             ent classifiers, International Research Journal of
     pp. 589–599.                                                  Engineering and Technology (IRJET) 4 (2017) 2238–
 [8] Z. Farou, N. Mouhoub, T. Horváth, Data gen-                   2242.
     eration using gene expression generator, in:             [19] N. Jalal, A. Mehmood, G. S. Choi, I. Ashraf, A novel
     C. Analide, P. Novais, D. Camacho, H. Yin (Eds.),             improved random forest for text classification us-
     Intelligent Data Engineering and Automated Learn-             ing feature ranking and optimal number of trees,
     Journal of King Saud University-Computer and In-         classification in r, Knowledge-Based Systems 161
     formation Sciences (2022).                               (2018) 329–341.
[20] C. Mi, L. Xie, Y. Zhang, Improving data augmen-
     tation for low resource speech-to-text translation
     with diverse paraphrasing, Neural Networks 148
     (2022) 194–205.
[21] M. Kim, P. Kang, Text embedding augmentation
     based on retraining with pseudo-labeled adversarial
     embedding, IEEE Access 10 (2022) 8363–8376.
[22] S. Bonthu, A. Dayal, M. Lakshmi, S. Rama Sree,
     Effective text augmentation strategy for nlp models,
     in: Proceedings of Third International Conference
     on Sustainable Computing, Springer, 2022, pp. 521–
     531.
[23] C. Coulombe, Text data augmentation made
     simple by leveraging NLP cloud apis, CoRR
     abs/1812.04718 (2018). URL: http://arxiv.org/abs/
     1812.04718. arXiv:1812.04718.
[24] S. Qiu, B. Xu, J. Zhang, Y. Wang, X. Shen, G. de Melo,
     C. Long, X. Li, EasyAug: An Automatic Textual Data
     Augmentation Platform for Classification Tasks, As-
     sociation for Computing Machinery, New York, NY,
     USA, 2020, p. 249–252. URL: https://doi.org/10.1145/
     3366424.3383552.
[25] Z. Feng, H. Zhou, Z. Zhu, K. Mao, Tailored text aug-
     mentation for sentiment analysis, Expert Systems
     with Applications (2022) 117605.
[26] R. Xiang, E. Chersoni, Q. Lu, C.-R. Huang, W. Li,
     Y. Long, Lexical data augmentation for sentiment
     analysis, Journal of the Association for Information
     Science and Technology 72 (2021) 1432–1447.
[27] D. T. Vu, G. Yu, C. Lee, J. Kim, Text data augmenta-
     tion for the korean language, Applied Sciences 12
     (2022) 3425.
[28] A. Taan, Z. Farou, Supervised learning methods for
     skin segmentation based on pixel color classifica-
     tion, Central-European Journal of New Technolo-
     gies in Research, Education and Practice (2021).
[29] L. Breiman, Random forests, Machine learning 45
     (2001) 5–32.
[30] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf
     models for sequence tagging, arXiv preprint
     arXiv:1508.01991 (2015).
[31] K. W. Church, Word2vec, Natural Language Engi-
     neering 23 (2017) 155–162.
[32] C. Fellbaum, Wordnet, in: Theory and applications
     of ontology: computer applications, Springer, 2010,
     pp. 231–243.
[33] J. M. G. Hidalgo, T. A. Almeida, A. Yamakami, On
     the validity of a new sms spam collection, in: 2012
     11th International Conference on Machine Learning
     and Applications, volume 2, IEEE, 2012, pp. 240–
     245.
[34] I. Cordón, S. García, A. Fernández, F. Herrera, Im-
     balance: oversampling algorithms for imbalanced

</pre>