=Paper= {{Paper |id=Vol-2664/mexa3t_paper4 |storemode=property |title=ITCG’s Participation at MEX-A3T 2020: Aggressive Identification and Fake News Detection Based on Textual Features for Mexican Spanish |pdfUrl=https://ceur-ws.org/Vol-2664/mexa3t_paper4.pdf |volume=Vol-2664 |authors=Diego Zaizar-Gutiérrez,Daniel Fajardo-Delgado,Miguel Ángel Álvarez-Carmona |dblpUrl=https://dblp.org/rec/conf/sepln/Zaizar-Gutierrez20 }} ==ITCG’s Participation at MEX-A3T 2020: Aggressive Identification and Fake News Detection Based on Textual Features for Mexican Spanish== https://ceur-ws.org/Vol-2664/mexa3t_paper4.pdf
ITCG’s Participation at MEX-A3T 2020: Aggressive
Identification and Fake News Detection Based on
Textual Features for Mexican Spanish
Diego Zaizar-Gutiérreza , Daniel Fajardo-Delgadoa and Miguel
Á. Álvarez-Carmonab,c
a
  Tecnológico Nacional de México / Campus Ciudad Guzmán, Mexico
b
  Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Mexico
c
  Consejo Nacional de Ciencia y Tecnología (CONACYT), Mexico


                                         Abstract
                                         This paper explains our approach to Aggressiveness Identification and Fake News Classification in the
                                         2020 MEX-A3T shared task. The tasks propose a binary classification for both tasks (aggressive and non-
                                         aggressive or fake news and non-fake news). We approached the problem using simple basic methods
                                         of features selection and terms weighing. We trained with a set of machine learning algorithms. Our
                                         best run for aggressiveness identification achieved an accuracy of 0.81, where the best result obtained
                                         0.88. On the other hand, for the aggressiveness identification, our accuracy result was 0.78, where the
                                         best result was 0.85.

                                         Keywords
                                         Aggressiveness Identification, Fake News Classification, Natural Language Processing




1. Introduction
Nowadays, technology has a significant role in which people communicate with each other,
giving rise to new services such as social networks. Social networks present several challenges to
maintain communication channels open to the free sharing of ideas. For example, it is effortless
for some people sharing aggressive speeches affecting the experience of other consumers or
people interested in being part of the communities and their conversations.
   The number of messages sent daily makes the moderation of communication channels
challenging to deal with by conventional means, and as people increasingly communicate
online, the need for automated abusive language classifiers becomes much more profound [1].
   Another inconvenience of the free ideas traffic on social networks is sharing news without
an entity that regulates the veracity of the information. In [2], fake news is defined as fabricated
information that mimics news media content in form but not in organizational process or intent.


Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: diego15460155@itcg.edu.mx (D. Zaizar-Gutiérrez); dfajardo@itcg.edu.mx (D. Fajardo-Delgado);
malvarezc@cicese.mx (M.Á. Álvarez-Carmona)
orcid: 0000-0002-4486-2003 (D. Zaizar-Gutiérrez); 0000-0001-8215-5927 (D. Fajardo-Delgado);
0000-0003-4421-5575 (M.Á. Álvarez-Carmona)
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   The problem is that fake news overlaps with other information environments, such as
misinformation (false or misleading information), disinformation (false information that is
purposely spread to deceive people), and real news [2]. This causes that discerning between
real and false news can be a difficult task. For this reason, it is essential to design and develops
methods capable of classifying among that news.
   One of the goals of the third edition of MEX-A3T [3] is to tackle these problems and further
improve the research of this critical NLP task, the detection of aggressive tweets and fake news
in Mexican Spanish. In this work, we evaluate strategies proposed before, such as binary, TF,
TF-IDF weighted representations and different preprocessing approaches as measuring the stop
words and stemming importance to improve the features to observe the efficacy of these simple
textual representations.


2. State of the art
The MEX-A3T is an evaluation forum with natural processing tasks. MEX-A3T 2020 is the third
edition since the first edition carried on in 2018 [3].
   The 2018 edition of the MEX-A3T shared represented the first attempt for organizing an
evaluation forum for the analysis of social media content in Mexican Spanish. A variety of
methods were proposed by participants, comprising content-based (bag of words, word n-grams,
term vectors, dictionary words) and stylistic-based features (frequencies, punctuation, POS,
Twitter-specific elements, slang words) as well as approaches based on neural networks (CNN,
LSTM, and others) [4].
   For the first edition, the organizers proposed two tasks: author profiling and aggressiveness
identification. In both tasks, the baseline results were outperformed by most participants. For
author profiling, the best results were obtained with an approach that emphasized the value
of personal information for building the text representation. In the case of the aggressiveness
identification, the winner team proposed an approach based on MicroTC and EvoMSA. MicroTC
is a minimalistic text classifier independent from domain and language. EvoMSA is another
text classifier that combines models (as MicroTC) with Genetic Programming [4].
   For the second edition, the author profiling task added images information for each profile in
the collection, whereas the aggressiveness identification task continued unchanged [5].
   The participants proposed a variety of methodologies in the 2019 edition, from traditional
supervised methods to deep learning approaches. For author profiling, the best results were
obtained with an approach based on dimensionality reduction in text. However, their results
did not overcome the best results from the 2018 edition. For aggressiveness identification,
the top-ranked approach proposed two main kinds of features: character n-grams and word
embeddings. Their results were equal to the previous year winner but employing a simpler
approach [3].
   For the 2020 edition, the tasks proposed are aggressiveness identification (but for this year,
the organizers made a re-labeled over the collection) and fake news detection. Again, these
tasks are proposed for the Mexican Spanish [3] [6].
   Since, in the last edition, the simple approaches gave good results, we propose to apply basic
methods of features selection and terms weighing to observe the scope of these simple but




                                                259
effective approaches in various tasks.


3. Methodology
Our methodology consists of three phases: data preprocessing, the weighting of words, and
classification. Preprocessing deals with techniques to prepare the data sets by removing the
elements that do not provide meaningful information in making the classification models. In
this process, we first transform all the letters into lowercase, and then we tokenized the text
with non-letter separators. After that, we removed the stop words such as “el”, “la”, “los”, “con”,
which do not provide any valuable information in a sentence (commonly, words with less than
two letters). Additionally, to discard most of the alignment errors and dismiss most of the
less frequent words (which would cause a lot of false hits), we also deleted all the words that
repeat less than five times. Finally, we performed the stemming of common words that could be
meaningful for data interpretation. To test the effect of these last two steps (stemming and stop
words deletion) in the performance of the proposed models, we made combinations of them
throughout the experiments. We used the natural language toolkit (NLTK) for the steps of the
tokenization and deletion of the stop words. While for the step of stemming, we used Snowball
Stemmer.
   The phase of the weighting of words includes measuring the relevance of each word con-
cerning others. We used the following three feature-weighting methods: the binary occurrence
(BO) [7], the term frequency (TF) [8], and the term frequency-inverse document frequency
(TF-IDF) [9]. The binary occurrence, also called Boolean weighting, is an essential technique
used to represent a word’s presence by using only two values: 0 and 1. We used this technique
as a first approach because it is elementary and easy to implement. On the other hand, the TF
approach weighs each feature based on the frequency in which a word (or term) appears. Finally,
the TF-IDF approach proposes to assign reda lower weight to a term that appears in many
documents than another occurring in a few documents. Both TF and TF-IDF feature-weighting
methods were implemented via scikit-learn, a Python library for data mining and data analysis.
   After the data was cleaned and weighted, we used well-known learning algorithms to build
the classification models for the data sets. The learning algorithms we used were: support
vector machine (SVM) [10] [11] [12], naive Bayes (NB), k-nearest neighbors (KNN) [13], and
classification and regression trees (CART). We used the implementation of all these algorithms
included in scikit-learn.
   Finally, we trained our models by using 10-fold cross-validation, where each fold contains
around 700 sentences, for each of the selected learning algorithms. We present our results in
the next section.


4. Results and discussions
Table 1 shows the results for the data set of fake news [6] by performing a combination of
the proposed classification models, feature-weighting methods, and preprocessing techniques.
Concerning the classification models, we achieve the best result with the SVM model, followed
by the decision tree and the KNN (with 𝑘 = 3) models. Regarding the feature-weighting methods,



                                               260
it is not easy to see which one provides better results. However, it is noteworthy the poor
results obtained by a particular combination of the BO method with the stemming technique
and including the stop words. We also note that the use of stemming and the deletion of stop
words do not have a significant impact on the results for most of the cases. The best result
is achieved with an accuracy of 83% by using the SVM model with the TF method and the
stemming technique without removing the stop words.
   On the other hand, Table 2 shows the results for the data set of aggressiveness by using the
same combinations performed on Table 1. Given these combinations, the SVM model always
provides the best results for all the cases. In this regard, the decision tree and the KNN with
𝑘 = 5 obtained the second and third places. Concerning the feature-weighting methods, we
observe that the TF and TF-IDF methods significantly outperform the BO for all the cases. Using
the TF-IDF, the SVM model achieves an accuracy of 84% when the deletion of the stop words is
performed. For this best result, the stemming technique does not have a significant effect.
   Note that in the aggressiveness table of the results exists rows that contain only zeros except
in precision this because the model could not learn in the right way and always classify with
the majority class. The results shown in both tables only belong to class 1; this is due to class 1
is the most important.
   We decided to delete the words that repeat less than five times because this technique delivered
the best value when we tested, doing that, we handled spelling mistakes, use of abbreviations,
emoticons, etc. This is due to most of these do not repeat more than five times.


5. Conclusions
In this work, we address the tasks of fake news and aggressiveness identification for the 2020
MEX-A3T contest. For this aim, we built classification models using well-known machine
learning algorithms such as SVM, NB, KNN, and decision trees. We conducted a comparative
experimental procedure to study the impact of the proposed models using two data preprocessing
techniques (stop words and stemming) and three feature-weighting methods (BO, TF, and
TF-IDF). Experimental results indicate that, in general, the best classification model is SVM.
They also show that the efficiency of the classification models was mainly influenced by the
combination of data preprocessing methods instead of the feature-weighting methods. We
noticed that it took a long time to execute all the sentences, so we decided to use 10-folds
cross-validation to reduce the processing time. A limitation of this work is that the applied
methods rely solely on a bag of words approach and not other textual representations. An
interesting prospect for future work is to explore more advanced techniques, which hopefully
allow us to get a better place in the next MEX-AT3 contests.


References
 [1] M. Casavantes, R. López, L. C. González, Uach at mex-a3t 2019: Preliminary results on
     detecting aggressive tweets by adding author information via an unsupervised strategy, in:
     In Proceedings of the First Workshop for Iberian Languages Evaluation Forum (IberLEF
     2019), CEUR WS Proceedings, 2019.



                                               261
 [2] D. M. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J.
     Metzger, B. Nyhan, G. Pennycook, D. Rothschild, et al., The science of fake news, Science
     359 (2018) 1094–1096.
 [3] M. E. Aragón, H. Jarquín, M. Montes-y Gómez, H. J. Escalante, L. Villaseñor-Pineda,
     H. Gómez-Adorno, G. Bel-Enguix, J.-P. Posadas-Durán, Overview of mex-a3t at iberlef
     2020: Fake news and aggressiveness analysis in mexican spanish, in: Notebook Papers of
     2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain,
     September, 2020.
 [4] M. Á. Álvarez-Carmona, E. Guzmán-Falcón, M. Montes-y Gómez, H. J. Escalante,
     L. Villasenor-Pineda, V. Reyes-Meza, A. Rico-Sulayes, Overview of mex-a3t at ibereval
     2018: Authorship and aggressiveness analysis in mexican spanish tweets, in: Notebook
     Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for
     Iberian Languages (IBEREVAL), Seville, Spain, volume 6, 2018.
 [5] M. E. Aragón, M. Á. Á. Carmona, M. Montes-y Gómez, H. J. Escalante, L. V. Pineda,
     D. Moctezuma, Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness
     analysis in mexican spanish tweets., in: IberLEF@ SEPLN, 2019, pp. 478–494.
 [6] J.-P. Posadas-Durán, H. Gómez-Adorno, G. Sidorov, J. J. M. Escobar, Detection of fake
     news in a new corpus for the spanish language, Journal of Intelligent & Fuzzy Systems 36
     (2019) 4869–4876.
 [7] H. Wang, P. Yin, J. Yao, J. N. Liu, Text feature selection for sentiment classification of
     chinese online reviews, Journal of Experimental & Theoretical Artificial Intelligence 25
     (2013) 425–439. doi:10.1080/0952813X.2012.721139.
 [8] Q. Ye, Z. Zhang, R. Law, Sentiment classification of online reviews to travel destinations
     by supervised machine learning approaches, Expert Systems with Applications 36 (2009)
     6527–6535. doi:https://doi.org/10.1016/j.eswa.2008.07.035.
 [9] F. Bravo-Marquez, M. Mendoza, B. Poblete, Meta-level sentiment models for big social
     data analysis, Knowledge-Based Systems 69 (2014) 86–99. doi:https://doi.org/10.
     1016/j.knosys.2014.05.016.
[10] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers,
     in: Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp.
     144–152.
[11] V. Vapnik, The nature of statistical learning theory, Springer science & business media,
     2013.
[12] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
[13] L. E. Peterson, K-nearest neighbor, Scholarpedia 4 (2009) 1883. doi:10.4249/
     scholarpedia.1883, revision #137311.




                                              262
Table 1
Results for the data set of fake news by using different classifier models, feature-weighting methods,
and preprocessing techniques.
                    Model        Weighted   Stopwords   Stemming   Precision   Recall   F1-score   Accuracy

                     SVM           BO       Without       No         0.84       0.75      0.80     0.81
                 Naive Bayes       BO       Without       No         0.56       0.85      0.67     0.59
                  KNN (K=5)        BO       Without       No         0.52       0.34      0.41     0.81
                  KNN (K=3)        BO       Without       No         0.52       0.34      0.41     0.81
                  KNN (K=1)        BO       Without       No         0.52       0.35      0.42     0.51
                 Decision Tree     BO       Without       No         0.66       0.62      0.64     0.65

                     SVM           BO         With        No         0.49       0.49      0.49     0.49
                      NB           BO         With        No         0.49       0.27      0.35     0.50
                  KNN (K=5)        BO         With        No         0.49       0.27      0.35     0.50
                  KNN (K=3)        BO         With        No         0.50       0.28      0.36     0.50
                  KNN (K=1)        BO         With        No         0.48       0.46      0.47     0.48
                 Decision Tree     BO         With        No         0.48       0.43      0.45     0.48

                     SVM           BO       Without       Yes        0.81       0.74      0.77     0.78
                      NB           BO       Without       Yes        0.55        0.8      0.65     0.58
                  KNN (K=5)        BO       Without       Yes        0.10       0.01      0.02     0.51
                  KNN (K=3)        BO       Without       Yes        1.00       0.01      0.03     0.51
                  KNN (K=1)        BO       Without       Yes        0.50       0.33      0.39     0.5
                 Decision Tree     BO       Without       Yes        0.67       0.62      0.65     0.66

                     SVM           BO         With        Yes        0.79       0.74      0.77     0.77
                      NB           BO         With        Yes        0.55        0.8      0.65     0.58
                  KNN (K=5)        BO         With        Yes        1.00       0.01      0.02     0.51
                  KNN (K=3)        BO         With        Yes        1.00       0.02      0.03     0.51
                  KNN (K=1)        BO         With        Yes        0.49       0.32      0.39     0.49
                 Decision Tree     BO         With        Yes        0.67       0.64      0.66     0.66

                     SVM           TF       Without       No         0.82       0.81      0.82     0.82
                      NB           TF       Without       No         0.50       0.66      0.57     0.50
                  KNN (K=5)        TF       Without       No         0.82       0.62      0.70     0.74
                  KNN (K=3)        TF       Without       No         0.81       0.65      0.72     0.75
                  KNN (K=1)        TF       Without       No         0.73       0.63      0.67     0.70
                 Decision Tree     TF       Without       No         0.70       0.68      0.69     0.69

                     SVM           TF         With        No         0.83       0.81      0.82     0.82
                      NB           TF         With        No         0.51       0.66      0.57     0.51
                  KNN (K=5)        TF         With        No         0.83       0.61      0.70     0.74
                  KNN (K=3)        TF         With        No         0.80       0.64      0.71     0.74
                  KNN (K=1)        TF         With        No         0.73       0.63      0.68     0.70
                 Decision Tree     TF         With        No         0.67       0.69      0.68     0.67

                     SVM           TF       Without       Yes        0.86       0.74      0.80     0.81
                      NB           TF       Without       Yes        0.52       0.66      0.58     0.52
                  KNN (K=5)        TF       Without       Yes        0.94       0.22      0.35     0.60
                  KNN (K=3)        TF       Without       Yes        0.92       0.24      0.38     0.61
                  KNN (K=1)        TF       Without       Yes        0.73       0.27      0.39     0.58
                 Decision Tree     TF       Without       Yes        0.64       0.64      0.64     0.64

                     SVM           TF         With        Yes        0.85       0.81      0.83     0.83
                      NB           TF         With        Yes        0.52       0.67      0.58     0.52
                  KNN (K=5)        TF         With        Yes        0.79       0.62      0.69     0.73
                  KNN (K=3)        TF         With        Yes        0.78       0.64      0.70     0.73
                  KNN (K=1)        TF         With        Yes        0.73       0.64      0.68     0.70
                 Decision Tree     TF         With        Yes        0.68       0.68      0.68     0.68

                     SVM          TF-IDF    Without       No         0.82       0.69      0.75     0.77
                      NB          TF-IDF    Without       No         0.50       0.66      0.57     0.49
                  KNN (K=5)       TF-IDF    Without       No         0.58       0.09      0.16     0.51
                  KNN (K=3)       TF-IDF    Without       No         0.49       0.24      0.32     0.49
                  KNN (K=1)       TF-IDF    Without       No         0.47       0.75      0.58     0.45
                 Decision Tree    TF-IDF    Without       No         0.71       0.68      0.69     0.70

                     SVM          TF-IDF      With        No         0.82       0.81      0.81     0.81
                      NB          TF-IDF      With        No         0.50       0.66      0.57     0.50
                  KNN (K=5)       TF-IDF      With        No         0.66       0.55      0.60     0.63
                  KNN (K=3)       TF-IDF      With        No         0.62       0.56      0.59     0.74
                  KNN (K=1)       TF-IDF      With        No         0.50       0.59      0.54     0.50
                 Decision Tree    TF-IDF      With        No         0.72       0.71      0.72     0.72

                     SVM          TF-IDF    Without       Yes        0.80       0.72      0.76     0.77
                      NB          TF-IDF    Without       Yes        0.51       0.66      0.58     0.51
                  KNN (K=5)       TF-IDF    Without       Yes        0.57       0.05      0.09     0.51
                  KNN (K=3)       TF-IDF    Without       Yes        0.41       0.08      0.14     0.48
                  KNN (K=1)       TF-IDF    Without       Yes        0.43       0.22      0.29     0.47
                 Decision Tree    TF-IDF    Without       Yes        0.62       0.62      0.62     0.62

                     SVM          TF-IDF      With        Yes        0.80       0.78      0.79     0.79
                      NB          TF-IDF      With        Yes        0.51       0.67      0.58     0.52
                  KNN (K=5)       TF-IDF      With        Yes        0.81       0.28      0.41     0.61
                  KNN (K=3)       TF-IDF      With        Yes        0.71       0.31      0.43     0.59
                  KNN (K=1)       TF-IDF      With        Yes        0.56       0.39      0.46     0.54
                 Decision Tree    TF-IDF      With        Yes        0.64       0.66      0.65     0.64


                                                           263
Table 2
Results for the data set of aggressiveness using different classifier models, feature-weighting methods,
and preprocessing techniques
                    Model        Weighted   Stopwords   Stemming   Precision   Recall   F1-score   Accuracy

                     SVM           BO       Without       No         0.00       0.00      0.00     0.71
                      NB           BO       Without       No         0.29       0.90      0.44     0.33
                  KNN (K=5)        BO       Without       No         0.00       0.00      0.00     0.71
                  KNN (K=3)        BO       Without       No         0.29       0.40      0.34     0.54
                  KNN (K=1)        BO       Without       No         0.29       0.90      0.44     0.33
                 Decision Tree     BO       Without       No         0.00       0.00      0.00     0.71

                     SVM           BO         With        No         0.00       0.00      0.00     0.71
                      NB           BO         With        No         0.29       0.90      0.44     0.33
                  KNN (K=5)        BO         With        No         0.00       0.00      0.00     0.71
                  KNN (K=3)        BO         With        No         0.29       0.70      0.41     0.42
                  KNN (K=1)        BO         With        No         0.00       0.00      0.00     0.71
                 Decision Tree     BO         With        No         0.00       0.00      0.00     0.71

                     SVM           BO       Without       Yes        0.00       0.00      0.00     0.71
                      NB           BO       Without       Yes        0.29       0.90      0.44     0.33
                  KNN (K=5)        BO       Without       Yes        0.28       0.10      0.14     0.67
                  KNN (K=3)        BO       Without       Yes        0.29       0.40      0.34     0.54
                  KNN (K=1)        BO       Without       Yes        0.28       0.10      0.14     0.67
                 Decision Tree     BO       Without       Yes        0.00       0.00      0.00     0.71

                     SVM           BO         With        Yes        0.00       0.00      0.00     0.71
                      NB           BO         With        Yes        0.29       0.90      0.44     0.33
                  KNN (K=5)        BO         With        Yes        0.28       0.10      0.14     0.67
                  KNN (K=3)        BO         With        Yes        0.29       0.40      0.34     0.54
                  KNN (K=1)        BO         With        Yes        0.28       0.10      0.14     0.67
                 Decision Tree     BO         With        Yes        0.00       0.00      0.00     0.71

                     SVM           TF       Without       No         0.72       0.64      0.68     0.82
                      NB           TF       Without       No         0.33       0.82      0.47     0.46
                  KNN (K=5)        TF       Without       No         0.66       0.29      0.40     0.75
                  KNN (K=3)        TF       Without       No         0.62       0.33      0.43     0.75
                  KNN (K=1)        TF       Without       No         0.52       0.39      0.45     0.72
                 Decision Tree     TF       Without       No         0.62       0.57      0.60     0.78

                     SVM           TF         With        No         0.71       0.64      0.67     0.82
                   NBayes          TF         With        No         0.33       0.83      0.47     0.46
                  KNN (K=5)        TF         With        No         0.65       0.27      0.38     0.75
                  KNN (K=3)        TF         With        No         0.62       0.30      0.41     0.75
                  KNN (K=1)        TF         With        No         0.52       0.37      0.43     0.72
                 Decision Tree     TF         With        No         0.63       0.62      0.63     0.79

                     SVM           TF       Without       Yes        0.70       0.61      0.65     0.81
                      NB           TF       Without       Yes        0.32       0.86      0.46     0.43
                  KNN (K=5)        TF       Without       Yes        0.75       0.27      0.40     0.76
                  KNN (K=3)        TF       Without       Yes        0.65       0.32      0.43     0.76
                  KNN (K=1)        TF       Without       Yes        0.57       0.42      0.48     0.74
                 Decision Tree     TF       Without       Yes        0.61       0.60      0.60     0.78

                     SVM           TF         With        Yes        0.71       0.64      0.68     0.82
                      NB           TF         With        Yes        0.31       0.86      0.46     0.42
                  KNN (K=5)        TF         With        Yes        0.68       0.27      0.39     0.75
                  KNN (K=3)        TF         With        Yes        0.60       0.30      0.40     0.74
                  KNN (K=1)        TF         With        Yes        0.53       0.37      0.44     0.72
                 Decision Tree     TF         With        Yes        0.62       0.61      0.62     0.78

                     SVM          TF-IDF    Without       No         0.82       0.53      0.64     0.83
                      NB          TF-IDF    Without       No         0.33       0.82      0.47     0.47
                  KNN (K=5)       TF-IDF    Without       No         0.69       0.24      0.36     0.75
                  KNN (K=3)       TF-IDF    Without       No         0.61       0.30      0.40     0.74
                  KNN (K=1)       TF-IDF    Without       No         0.50       0.38      0.43     0.71
                 Decision Tree    TF-IDF    Without       No         0.61       0.59      0.60     0.77

                     SVM          TF-IDF      With        No         0.82       0.55      0.66     0.84
                      NB          TF-IDF      With        No         0.33       0.82      0.47     0.47
                  KNN (K=5)       TF-IDF      With        No         0.70       0.18      0.28     0.74
                  KNN (K=3)       TF-IDF      With        No         0.62       0.25      0.35     0.74
                  KNN (K=1)       TF-IDF      With        No         0.50       0.31      0.38     0.71
                 Decision Tree    TF-IDF      With        No         0.61       0.59      0.60     0.77

                     SVM          TF-IDF    Without       Yes        0.79       0.54      0.64     0.83
                      NB          TF-IDF    Without       Yes        0.32       0.85      0.46     0.43
                  KNN (K=5)       TF-IDF    Without       Yes        0.71       0.15      0.25     0.74
                  KNN (K=3)       TF-IDF    Without       Yes        0.64       0.23      0.34     0.74
                  KNN (K=1)       TF-IDF    Without       Yes        0.52       0.37      0.43     0.72
                 Decision Tree    TF-IDF    Without       Yes        0.61       0.59      0.60     0.78

                     SVM          TF-IDF      With        Yes        0.81       0.58      0.67     0.84
                      NB          TF-IDF      With        Yes        0.32       0.85      0.46     0.42
                  KNN (K=5)       TF-IDF      With        Yes        0.82       0.12      0.22     0.74
                  KNN (K=3)       TF-IDF      With        Yes        0.71       0.17      0.27     0.74
                  KNN (K=1)       TF-IDF      With        Yes        0.54       0.26      0.35     0.72
                 Decision Tree    TF-IDF      With        Yes        0.61       0.58      0.60     0.77


                                                           264