Sexism Identification in Social Networks using TF-IDF
                         Embeddings, PreProccessing, Feature Selection,
                         Word/Char N-Grams and Various Machine Learning
                         Models In Spanish and English
                         Ron Keinan1,*,†
                         1
                             Department of Computer Science, Jerusalem College of Technology, Lev Academic Center.


                                        Abstract
                                        In this paper, we describe our submission to the EXIST-2024 contest. We tackled Task 1 - “Sexism Identification
                                        in tweets" in English and Spanish. To classify the tweets as texts containing sexism, we created different set up of
                                        models, changing the ML classifier, the feature type(word/char), the feature amount and the preprocessing of
                                        the text. With this set up, we vectorized the text data using tf-idf embedding technique. After training all these
                                        set-ups on the training dataset, we chose the best models according to their accuracy and F1-score on the dev set,
                                        and used them to predict the test labels. The best model got a F1 score of 72.23 and the rank of 39 out of 70.

                                        Keywords
                                        Sexism identification, machine learning, TF-IDF, feature selection, char based n-grams,


                         1. Introduction
                         Sexism identification in social networks has emerged as a significant challenge within the field of
                         Natural Language Processing. This task involves detecting and classifying sexist content within social
                         media posts, which is crucial for maintaining respectful and inclusive online environments. The
                         identification of sexist remarks is not only important for individual platforms to manage content
                         but also for broader societal implications, such as monitoring and mitigating the spread of harmful
                         stereotypes and promoting gender equality[1].
                            Social networks have become the primary platforms for social complaints, activism, and widespread
                         movements such as MeToo, 8M, and Time’sUp. These movements have gained momentum quickly, with
                         countless women around the world sharing their experiences of abuse, discrimination, and other forms
                         of sexism encountered in their daily lives. While social networks play a crucial role in amplifying voices
                         against injustice, they also serve as conduits for the transmission of sexism and other disrespectful and
                         hateful behaviors[2].
                            In this context, the development of automatic tools for sexism identification is essential. These
                         tools can aid in detecting and flagging sexist behaviors, providing real-time alerts to help manage
                         and moderate online content. Furthermore, they enable the estimation of the prevalence of sexist and
                         abusive situations on social media platforms. By analyzing patterns and forms of sexism, these tools
                         can offer insights into how sexism is expressed and propagated in these digital spaces.
                            The significance of this task lies in its potential to enhance the safety and inclusivity of social media
                         environments. Effective sexism identification tools can not only assist in immediate content moderation
                         but also contribute to long-term strategies for reducing the spread of harmful stereotypes and fostering
                         a more respectful online community. The efforts in this area, including the contributions from this
                         lab, are pivotal in developing robust applications aimed at detecting and mitigating sexism in social
                         networks.
                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ ronke21@gmail.com (R. Keinan)
                          0009-0006-3122-6143 (R. Keinan)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In this paper, we describe our participation in the EXIST-2024 contest[3][4], specifically addressing
Task 1 - "Sexism Identification in tweets" in English and Spanish. The approach to solving this task
involved creating multiple models by varying several key components: the machine learning classifier
used, the type of features (word-level or character-level), the number of features, and the preprocessing
techniques applied to the text data. Subsequently, we vectorized the text data using the Term Frequency-
Inverse Document Frequency (TF-IDF) embedding technique.
   The importance of this task is underscored by the growing volume of user-generated content on
social media platforms, where the rapid identification and mitigation of sexist content can significantly
impact user experience and safety. By leveraging a combination of preprocessing, feature selection, and
various machine learning models, our approach contributes to the ongoing efforts in developing robust
automated systems for sexism detection.


2. Theoretical Review
2.1. Feature Selection
Feature selection is a critical process in text classification tasks, significantly impacting model perfor-
mance by identifying the most informative attributes from the text data. In our approach to sexism
identification, we meticulously focused on selecting features based on two primary types: word n-grams
and character n-grams.

2.1.1. Word N-grams
Word n-grams represent contiguous sequences of words within the text, capturing contextual rela-
tionships and syntactic structures. By considering sequences of words, n-grams facilitate the model’s
understanding of semantic meaning conveyed through word combinations. For instance, in a bigram
model, pairs of consecutive words are considered, while a trigram model examines sequences of three
words. Using the sentence “The quick brown fox jumps over the lazy dog” as an example, the bigrams
include “The quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the lazy”, and
“lazy dog”. Trigrams, on the other hand, include sequences like “The quick brown”, “quick brown fox”,
“brown fox jumps”, “fox jumps over”, “jumps over the”, “over the lazy”, and “the lazy dog”. This granular
approach helps in capturing the syntactic structure and semantic nuances of word combinations, which
are pivotal for understanding context-dependent expressions of sexism.
   However, word n-grams have limitations, particularly when dealing with sparse data and out-of-
vocabulary words, which are prevalent in social media texts. To mitigate these issues, we implemented
techniques such as TF-IDF weighting to emphasize the importance of rare but informative n-grams and
reduce the impact of common but less informative ones.

2.1.2. Character N-grams
Character n-grams, especially those with word boundaries (char-wb), segment the text into sequences
of characters while respecting word boundaries. This method is adept at capturing morphological
patterns and handling variations such as typos, slang, and informal language, which are ubiquitous in
social media. For instance, character n-grams of length six in the word “identification” might include
“identi”, “dentif”, “entifi”, and so on. By incorporating word boundaries, char-wb n-grams can maintain
the integrity of individual words while allowing the model to learn from character-level patterns.
   Our experiments demonstrated that character n-grams, particularly of medium length (around six
characters), consistently outperformed word n-grams. This indicates their superior ability to capture
the nuanced morphological features and informal linguistic variations typical in sexist language. The
flexibility of character n-grams in handling different morphological structures and idiomatic expressions
was particularly beneficial in our dataset, which included diverse and colloquial expressions of sexism.
2.1.3. Comparative Analysis
Through extensive experimentation, we observed that models utilizing character n-grams with word
boundaries achieved higher accuracy and F1 scores compared to those relying solely on word n-
grams. This suggests that character n-grams provide a richer and more robust feature set for sexism
identification in tweets, capable of capturing subtle and context-dependent expressions of sexism that
may be overlooked by word n-grams alone.

2.1.4. TF-IDF Embeddings
To optimize the feature selection process, we employed the Term Frequency-Inverse Document Fre-
quency (TF-IDF) technique. TF-IDF helps in quantifying the importance of each n-gram by balancing
its frequency within a document against its frequency across all documents in the dataset. By doing so,
it highlights the most informative features that are likely to contribute to the classification task.

2.2. Text Embeddings
Text embeddings are representations of textual data in a continuous vector space, enabling algorithms to
process and analyze text effectively. These embeddings capture both semantic and syntactic similarities
between words or documents, facilitating various Natural Language Processing (NLP) tasks such as
sentiment analysis, document classification, and information retrieval.

2.2.1. Types of Text Embeddings
There are several types of text embeddings, each with its unique characteristics and applications:

Word Embeddings Word embeddings, such as Word2Vec and GloVe, map each word to a high-
dimensional vector, capturing semantic relationships based on the context in which words appear. For
instance, words with similar meanings (e.g., "king" and "queen") are located close to each other in the
vector space, while unrelated words are far apart. Word embeddings are particularly useful for tasks
that require understanding word semantics, such as word analogy tasks and semantic similarity.

Contextualized Word Embeddings Contextualized word embeddings, such as those generated by
models like ELMo, BERT, and GPT, provide representations that vary depending on the word’s context
in a sentence. Unlike static word embeddings, these embeddings can capture the polysemy of words (i.e.,
words with multiple meanings). For example, the word "bank" will have different embeddings in the
sentences "I sat on the bank of the river" and "I deposited money in the bank." This context-awareness
significantly improves performance in tasks like named entity recognition, question answering, and
machine translation.

Document Embeddings Document embeddings extend the concept of word embeddings to larger
text units, such as sentences, paragraphs, or entire documents. Techniques like Doc2Vec and Universal
Sentence Encoder create fixed-length vectors that represent the overall meaning of a text segment.
These embeddings are valuable for tasks such as document classification, clustering, and information
retrieval, where the goal is to compare and analyze entire documents rather than individual words.

2.2.2. Significance in NLP
The use of text embeddings represents a significant advancement in NLP, as they provide a dense and
continuous representation of text that traditional bag-of-words models cannot achieve. Embeddings
allow for the efficient handling of large vocabularies and capture intricate relationships between words
and phrases. This has led to substantial improvements in various NLP tasks, making embeddings a
crucial component of modern NLP systems.
2.2.3. TF-IDF Embeddings
In our study, we utilized Term Frequency-Inverse Document Frequency (TF-IDF), as an embedding
method. [5]
   TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the
importance of a word in a document relative to a collection of documents. It calculates a weight for each
word based on its frequency in the document and its inverse frequency across all documents. Words
with high TF-IDF scores are considered more informative for distinguishing documents (Ramos, 2003).
   The TF-IDF (Term Frequency-Inverse Document Frequency) score is calculated as follows:

                                 TF-IDF(𝑡, 𝑑, 𝐷) = TF(𝑡, 𝑑) × IDF(𝑡, 𝐷)                                (1)
  Where:
                              Number of times term 𝑡 appears in document 𝑑
                    TF(𝑡, 𝑑) =
                                  (︂Total number of terms in document 𝑑
                                    Total number of documents in the corpus |𝐷|
                                                                                )︂
                  IDF(𝑡, 𝐷) = log
                                       Number of documents containing term 𝑡

  By employing these diverse embedding techniques, we aimed to capture the rich semantic and
syntactic features of the text, enhancing the performance of our models in identifying and classifying
sexist content in social media posts.

2.3. Machine Learning Classifiers
In the approach to sexism identification, we experimented with a variety of machine learning classi-
fiers to determine the most effective model for our task. Each classifier brings unique strengths and
characteristics, making them suitable for different aspects of the classification problem. The classifiers
where chosen from highest accuray models from Lazy Predict. Below, we describe the key classifiers
we employed:

   1. Random Forest Classifier (RandomForestClassifier):
         • The Random Forest Classifier is another ensemble learning method that constructs mul-
           tiple decision trees during training and outputs the mode of the classes for classification
           tasks[6][7]. By averaging the results from multiple trees, it enhances predictive accuracy
           and controls over fitting. Random forests are particularly effective for datasets with a large
           number of features and complex, non-linear relationships.
   2. Extra Trees Classifier (ExtraTreesClassifier):
         • The Extra Trees Classifier is an ensemble learning method that aggregates the results of
           multiple unpruned decision trees, generated from random subsets of the training data and
           features[8]. This technique enhances the model’s robustness and accuracy by reducing
           variance and preventing overfitting. It is known for its high performance in handling large
           datasets and capturing complex interactions among features.
   3. LightGBM Classifier (LGBMClassifier):
         • LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It
           is designed to be efficient and scalable, making it Ill-suited for large datasets and high-
           dimensional data[9]. LightGBM incorporates techniques such as leaf-wise tree growth
           and histogram-based decision tree learning, which improve speed and accuracy while
           maintaining low memory usage. It excels in handling categorical features and complex data
           structures.
   4. AdaBoost Classifier (AdaBoostClassifier):
          • AdaBoost, short for Adaptive Boosting, combines the predictions of several ’ak classifiers to
            create a strong classifier[10]. It works by sequentially training classifiers, each focusing on
            the errors made by the previous ones. This iterative approach allows AdaBoost to improve
            the model’s performance by emphasizing the difficult-to-classify instances. It is versatile and
            can be used with various base learners, though it is most commonly paired with decision
            trees.
   5. Bernoulli Naive Bayes (BernoulliNB):
          • The Bernoulli Naive Bayes classifier is based on Bayes’ theorem and assumes that features
            follow a Bernoulli distribution (binary or boolean values). It is especially suited for bina-
            ry/boolean features and is effective for tasks where the presence or absence of a feature
            is more important than its frequency. This classifier is simple, fast, and performs Ill on
            high-dimensional sparse datasets.[11]
   6. Support Vector Classifier (SVC):
          • The Support Vector Classifier is a powerful and versatile classifier that constructs a hy-
            perplane or set of hyperplanes in a high-dimensional space to separate different classes.
            It is particularly effective in high-dimensional spaces and for cases where the number of
            dimensions exceeds the number of samples. SVC is robust to overfitting, especially in high-
            dimensional space, and can be extended to handle non-linear classification using kernel
            functions[12][13].

  By evaluating these classifiers using LazyPredict, we was able to quickly identify which models
performed best on our dataset. This informed our decision-making process and guided us in selecting
and fine-tuning the models that ultimately provided the highest accuracy and F1 scores for sexism
identification in social networks.


3. EXIST 2024 Contest and Task 1 Overview
3.1. EXIST 2024
The EXIST 2024 competition[3] focuses on the identification of sexism in social media, with a particular
emphasis on analyzing tweets. The primary task within this competition is a binary classification
problem, where systems must determine whether a given tweet contains sexist expressions or behaviors.
This includes tweets that are sexist themselves, describe a sexist situation, or criticize sexist behavior.
  For instance, the following tweets illustrate examples of sexist and non-sexist messages:

Sexist:
    • “Mujer al volante, tenga cuidado!”
    • “People really try to convince women with little to no ass that they should go out and buy a body.
      Like bih, I don’t need a fat ass to get a man. Never have.”

Not Sexist:
    • “Alguien me explica que zorra hace la gente en el cajero que se demora tanto.”
    • "@messyworldorder it’s honestly so embarrassing to watch and they’ll be like ’not all white
      women are like that’"

3.2. Task 1
In Task 1, participants are required to develop models that can accurately classify tweets into these two
categories. The challenge lies in creating a system that can effectively discern the subtle nuances of
language and context that indicate sexism. The objective is to build models that are not only precise
in detecting overtly sexist remarks but also adept at identifying more covert and context-dependent
expressions of sexism.
   The development and evaluation of these models involve several stages, including data preprocessing,
feature extraction, and the application of various machine learning algorithms. The ultimate goal is to
create robust and reliable tools that can contribute to the broader effort of mitigating sexism on social
media platforms, thereby promoting a healthier and more respectful online discourse.


4. Sexism Identification Methodology
Our methodology for identifying sexism in social media posts was based on a systematic approach using
training and development datasets exclusively. The primary objective was to train various machine
learning models on the training dataset and then select the best-performing models based on their
accuracy and F1 score, as stipulated by the competition requirements, on the development dataset.
our approach to solving the task was based on a previous study that dealt with a similar sentiment
classification task [14][15] and was based on a comparison of different embedding methods and then a
comparison between different regression classifiers.

4.1. Text Embedding
we began by employing text embedding techniques to represent the textual data in a vectorized format.
Specifically, we utilized the Term Frequency-Inverse Document Frequency (TF-IDF) method for each
language in our dataset. TF-IDF transforms text into numerical vectors based on the frequency of terms
within documents relative to a collection of documents. we experimented with different configurations,
including:

    • Various feature types such as words, characters, and character n-grams (e.g., bigrams, trigrams).
    • Different feature ranges, ranging from single words to sequences of characters of varying lengths.
    • Various amounts of features were chosen, ranging from 1,000 to 20,000, to determine the optimal
      number of features for classification.

4.2. Text PreProcessing
Text preprocessing is a critical step in Natural Language Processing, especially in tasks such as Sexism
Identification. In both general and social media text documents, various types of noise are commonly
present. This noise can include typos, emojis, slang, HTML tags, spelling mistakes, and repetitive letters.
If the text is not properly preprocessed, it can lead to incorrect analysis outcomes and significantly
impact the performance of the models.
   Former researchers[16][17] explored the effects of all possible combinations of six preprocessing
methods on text classification across three different datasets. Their main conclusion emphasized the
importance of systematically applying a variety of preprocessing techniques. By combining these
preprocessing methods with multiple machine learning approaches, the accuracy of text classification
can be substantially improved.
   In our work, we adopted a comprehensive preprocessing strategy to clean and standardize the text
data before applying further analytical techniques. This approach ensured that the models received
high-quality input, thereby enhancing their ability to accurately identify and classify sexist content in
social media posts.

4.3. Lazy Predict
LazyPredict is an open-source Python library designed to streamline the process of building and
comparing multiple machine learning models. It is particularly useful for quickly benchmarking
different algorithms without the need for extensive manual coding. By providing a simple interface,
LazyPredict allows data scientists to efficiently identify the most promising models for their specific
tasks[18].
   In the context of sexism identification task, LazyPredict proved to be a valuable tool during the
initial model selection phase. Given the variety of machine learning classifiers available, we needed a
systematic way to evaluate their performance on the dataset. LazyPredict facilitated this by automatically
training and testing a wide array of models using default hyper parameters, enabling us to gain a broad
understanding of which algorithms might be most effective for our problem.
   LazyPredict compared the following ML classifiers: AdaBoostClassifier, BaggingClassifier,
BernoulliNB, CalibratedClassifierCV, DecisionTreeClassifier, DummyClassifier, ExtraTreeClassifier, Ex-
traTreesClassifier, GaussianNB, KNeighborsClassifier, NuSVC, PassiveAggressiveClassifier, Perceptron,
QuadraticDiscriminantAnalysis, RandomForestClassifier, RidgeClassifier, RidgeClassifierCV, SGDClas-
sifier, SVC, LGBMClassifier. The results of the LazyPredict model on the data is presented in Table 1
(Appendices).

4.4. Model Training and Selection
With the vectorized representations of the text data, we proceeded to train multiple machine learning
models using the training dataset. we explored a diverse range of classifiers, including but not limited
to:

    • Extra Trees Classifier
    • LightGBM Classifier
    • Random Forest Classifier
    • AdaBoost Classifier
    • Bernoulli Naive Bayes
    • Support Vector Classifier (SVC)

  For each model, we evaluated its performance on the development dataset based on accuracy and
F1 score. we experimented with different feature combinations to optimize model performance. The
models that demonstrated the highest accuracy and F1 score on the development dataset Ire selected as
our best-performing models for further evaluation.

4.5. Test Prediction
Finally, we got a list of all the best models. To choose the models that will label the test pool and the
labels that will be accessible to the competition, we chose 3 groups of models. The 10 best models,
the 50 best models, and the 100 best models. we asked each group of models to tag the test database,
for each tweet we chose the majority of tags (yes or no) and created a JSON file that contains all the
answers.


5. Results
Table 2 (Appendices) presents the Accuracy rank and F1 score of the models for Task 1. The table shows
for each language the ideal model we received, feature type, range and amount, whether it performed
pre-processing, which classifier it used, what was the score we received in the dev phase.
   The most prominent classifiers in the best models are the ExtraTreesClassifier, RandomForestClassifier,
LGBMClassifier. They are based on classic machine learning algorithms - Random Forest and boosting,
and Naive Bayes which are recognized as classic classifiers but strong and good in many ML tasks.
Despite the well-known advantages of preprocessing methods in ML tasks, it seems that there is an
overall balance between models that were quicker to preprocess their text and models that worked
better on the raw text. It may be that more advanced preprocessing methods such as stemming or
lemmatization will be more helpful for learning.
   With respect to the type of features, sequences of characters seem to work much better than sequences
of words. And precisely a medium group of about 6 characters was better than low ranges of 3 or high
ranges of 9.
   Regarding the amount of features, it seems that more than 10000 features were often required to
obtain the good models, and low amounts converged to lower accuracy.
   The best model we sent was the combination of the results of the top 50 models and it came in 39th
place in the competition. The second model was a combination of the 100 best models in the competition
and it was ranked 41st. The model of the 10 best models was ranked 47th.


6. Conclusions
In this paper, we described our participation in the EXIST-2024 competition, focusing on the task
of sexism identification in tweets. Our approach involved experimenting with various models, text
preprocessing techniques, feature types, and feature amounts. Through systematic experimentation and
evaluation, we identified the most effective models based on accuracy and F1 score on the development
dataset.
   Our findings revealed several key insights. First, the ExtraTreesClassifier, RandomForestClassifier,
and LGBMClassifier emerged as the top-performing models. These classifiers, based on ensemble
learning techniques such as bagging and boosting, demonstrated strong performance across various
configurations. Additionally, we observed a balance between models that utilized text preprocessing
and those that did not. While preprocessing methods like stemming and lemmatization can potentially
enhance model performance by normalizing text, their impact varied, suggesting the need for more
advanced and context-specific preprocessing techniques.
   Moreover, character sequences generally outperformed word sequences, with character n-grams
of medium length (around six characters) providing better results compared to shorter or longer
sequences. This finding highlights the effectiveness of character n-grams in capturing the nuances
of sexist language. Furthermore, models with more than 10,000 features tended to perform better,
underscoring the importance of a rich feature set for capturing the subtleties in tweets.
   Overall, our study underscores the complexity of sexism identification in social media posts and the
importance of leveraging diverse techniques and models to achieve robust performance. These insights
contribute to the ongoing development of more accurate and reliable models for sexism detection in
online platforms.


7. Future Work
Our current work opens several avenues for future research and improvements. One significant
direction is the investigation of advanced preprocessing techniques, such as stemming, lemmatization,
and context-aware normalization. These sophisticated methods could enhance the robustness and
generalization of our models by better handling linguistic variations and subtleties.
   Additionally, enriching the training dataset with more examples from diverse sources and languages
is essential. This augmentation could improve the models’ ability to generalize across different contexts
and cultural nuances, thereby enhancing their performance.
   Conducting in-depth error analysis is another crucial area for future work. By thoroughly analyzing
recurrent misclassifications and patterns, we can understand the root causes of these errors, such
as sarcasm, irony, and cultural references. This understanding can inform the development of more
accurate and reliable models.
   Exploring additional feature types and combinations is also recommended. This includes investigating
domain-specific features that better capture the nuances of sexist language. Incorporating semantic
and syntactic features, as well as external knowledge sources, could provide a more comprehensive
understanding of the data.
   Lastly, extending our research to include deep learning models, such as BERT and Transformers, for
sexism identification is a promising direction. Addressing the unique challenges posed by different
languages, such as varying morphological structures and idiomatic expressions, will be critical in this
endeavor.
   By addressing these future directions, we aim further to enhance the effectiveness and applicability
of sexism identification models, contributing to the broader goal of combating sexism and promoting
equality in online spaces.


References
 [1] A. Jha, R. Mamidi, When does a compliment become sexist? analysis and classification of
     ambivalent sexism using twitter data, in: Proceedings of the Second Workshop on NLP and
     Computational Social Science, 2017.
 [2] F. Rodríguez-Sánchez, J. C. de Albornoz, L. Plaza, Automatic classification of sexism in social
     networks: An empirical study on twitter data, IEEE Access 8 (2020) 219563–219576.
 [3] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
     the CLEF Association (CLEF 2024), 2024.
 [4] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
 [5] J. Ramos, Using tf-idf to determine word relevance in document queries, in: Proceedings of the
     First Instructional Conference on Machine Learning, volume 242, 2003.
 [6] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140.
 [7] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32.
 [8] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine Learning 63 (2006) 3–42.
 [9] F. Alzamzami, M. Hoda, A. E. Saddik, Light gradient boosting machine for general sentiment
     classification on short texts: A comparative evaluation, IEEE Access 8 (2020) 101840–101858.
[10] R. E. Schapire, Explaining adaboost, in: Empirical Inference: Festschrift in Honor of Vladimir N.
     Vapnik, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 37–52.
[11] S.-B. Kim, K.-S. Han, H.-C. Rim, S. H. Myaeng, Some effective techniques for naive bayes text
     classification, IEEE Transactions on Knowledge and Data Engineering 18 (2006) 1457–1466.
     doi:10.1109/TKDE.2006.180.
[12] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297.
[13] C.-C. Chang, C.-J. Lin, Libsvm: A library for support vector machines, ACM Transactions on
     Intelligent Systems and Technology (TIST) 2 (2011) 1–27.
[14] R. Keinan, Y. HaCohen-Kerner, Jct at semeval-2023 tasks 12a and 12b: Sentiment analysis for tweets
     written in low-resource african languages using various machine learning and deep learning
     methods, resampling, and hyperparameter tuning, in: Proceedings of the 17th International
     Workshop on Semantic Evaluation (SemEval-2023), 2023.
[15] R. Keinan, Text mining at SemEval-2024 task 1: Evaluating semantic textual relatedness in
     low-resource languages using various embedding methods and machine learning regression
     models, in: A. K. Ojha, A. S. Doğruöz, H. Tayyar Madabushi, G. Da San Martino, S. Rosenthal,
     A. Rosá (Eds.), Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-
     2024), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 420–431. URL:
     https://aclanthology.org/2024.semeval-1.65.
[16] Y. HaCohen-Kerner, Y. Yigal, D. Miller, The impact of preprocessing on classification of mental
     disorders, in: Proceedings of the 19th Industrial Conference on Data Mining (ICDM 2019), New
     York, 2019.
[17] Y. HaCohen-Kerner, D. Miller, Y. Yigal, The influence of preprocessing on text classification using
     a bag-of-words representation, PLOS ONE 15 (2020) e0232525.
[18] M. I. J. Putra, V. Alexander, Comparison of machine learning land use-land cover supervised
     classifiers performance on satellite imagery sentinel 2 using lazy predict library, Indonesian
     Journal of Data and Science 4 (2023) 183–189.
8. Appendices- Result Tables

Table 1
LazyPredict Results
     Model                           Accuracy      Balanced Accuracy   F1 Score      Time Taken
     ExtraTreesClassifier            0.734104046   0.731715653         0.731389883   82.78278661
     LGBMClassifier                  0.726396917   0.724293441         0.724220837   6.884508371
     RandomForestClassifier          0.716763006   0.713832506         0.712489727   29.08332467
     BaggingClassifier               0.706165703   0.703156112         0.701514649   157.2697315
     AdaBoostClassifier              0.695568401   0.692479717         0.690518216   51.01095295
     BernoulliNB                     0.691714836   0.690796903         0.691232741   2.754544497
     SVC                             0.685934489   0.682405123         0.679168563   233.5694647
     NuSVC                           0.681117534   0.678913192         0.678410111   232.2601142
     NearestCentroid                 0.675337187   0.674771167         0.675151581   2.109311104
     DecisionTreeClassifier          0.671483622   0.670094208         0.670357222   33.56897783
     Perceptron                      0.661849711   0.660454248         0.660690278   4.56968379
     ExtraTreeClassifier             0.654142582   0.653128622         0.653515573   2.80695343
     SGDClassifier                   0.655105973   0.651847009         0.648841001   5.900460243
     LogisticRegression              0.650289017   0.648934589         0.649169867   8.320355654
     PassiveAggressiveClassifier     0.647398844   0.646404797         0.646789552   7.302331448
     LinearSVC                       0.628131021   0.62717317          0.627549493   69.49884391
     LinearDiscriminantAnalysis      0.619460501   0.619342328         0.619497601   159.9661644
     RidgeClassifier                 0.619460501   0.619342328         0.619497601   11.12075329
     CalibratedClassifierCV          0.625240848   0.619234598         0.601675594   294.2782121
     RidgeClassifierCV               0.61849711    0.618402479         0.618539615   162.7189815
     GaussianNB                      0.594412331   0.599300871         0.578915593   2.786282539
     QuadraticDiscriminantAnalysis   0.544315992   0.55240869          0.492225354   131.178328
     KNeighborsClassifier            0.523121387   0.511449077         0.388696557   4.114360094
     LabelSpreading                  0.514450867   0.501976285         0.351622593   15.14039254
     LabelPropagation                0.514450867   0.501976285         0.351622593   14.17368817
     DummyClassifier                 0.512524085   0.5                 0.347341163   1.893649578
Table 2
50 Best Results
 Classifier               Type      Range   Amount   Preprocessing                   Accuracy   F1
 ExtraTreesClassifier     char      6       20000    remove_punctuation              0.7649     0.7640
 ExtraTreesClassifier     char      6       10000    remove_spaces                   0.7649     0.7640
 RandomForestClassifier   char      6       10000    remove_punctuation              0.7649     0.7631
 RandomForestClassifier   char      6       17500    remove_punctuation              0.7620     0.7600
 ExtraTreesClassifier     char      6       10000    remove_punctuation              0.7611     0.7600
 RandomForestClassifier   char      6       15000    None                            0.7592     0.7567
 ExtraTreesClassifier     char      6       17500    None                            0.7582     0.7573
 ExtraTreesClassifier     char      6       7500     remove_numerical_punct_spaces   0.7582     0.7572
 ExtraTreesClassifier     char      6       12500    remove_spaces                   0.7582     0.7572
 ExtraTreesClassifier     char      6       7500     remove_spaces                   0.7572     0.7562
 ExtraTreesClassifier     char      6       7500     remove_punctuation              0.7572     0.7562
 ExtraTreesClassifier     char      6       12500    remove_punctuation              0.7563     0.7553
 ExtraTreesClassifier     char      6       15000    None                            0.7563     0.7551
 LGBMClassifier           char      3       17500    None                            0.7563     0.7537
 LGBMClassifier           char      3       17500    remove_punctuation              0.7563     0.7537
 LGBMClassifier           char      3       17500    remove_spaces                   0.7563     0.7537
 LGBMClassifier           char      3       17500    remove_numerical_punct_spaces   0.7563     0.7537
 ExtraTreesClassifier     char      6       10000    remove_numerical_punct_spaces   0.7553     0.7545
 RandomForestClassifier   char      6       15000    remove_numerical_punct_spaces   0.7553     0.7527
 ExtraTreesClassifier     char      6       10000    None                            0.7543     0.7534
 RandomForestClassifier   char      6       12500    remove_punctuation              0.7543     0.7526
 LGBMClassifier           char_wb   3       17500    None                            0.7543     0.7522
 LGBMClassifier           char_wb   3       17500    remove_punctuation              0.7543     0.7522
 LGBMClassifier           char_wb   3       17500    remove_spaces                   0.7543     0.7522
 LGBMClassifier           char_wb   3       17500    remove_numerical_punct_spaces   0.7543     0.7522
 RandomForestClassifier   char      6       17500    None                            0.7543     0.7520
 RandomForestClassifier   char      6       17500    remove_spaces                   0.7543     0.7519
 RandomForestClassifier   char      6       12500    None                            0.7534     0.7515
 LGBMClassifier           char      3       15000    None                            0.7534     0.7512
 LGBMClassifier           char      3       15000    remove_punctuation              0.7534     0.7512
 LGBMClassifier           char      3       15000    remove_spaces                   0.7534     0.7512
 LGBMClassifier           char      3       15000    remove_numerical_punct_spaces   0.7534     0.7512
 LGBMClassifier           char      3       12500    None                            0.7534     0.7511
 LGBMClassifier           char      3       12500    remove_punctuation              0.7534     0.7511
 LGBMClassifier           char      3       12500    remove_spaces                   0.7534     0.7511
 LGBMClassifier           char      3       12500    remove_numerical_punct_spaces   0.7534     0.7511
 RandomForestClassifier   char      6       20000    remove_spaces                   0.7534     0.7509
 ExtraTreesClassifier     char      6       15000    remove_numerical_punct_spaces   0.7524     0.7516
 ExtraTreesClassifier     char      6       17500    remove_spaces                   0.7524     0.7516
 ExtraTreesClassifier     char_wb   6       5000     remove_spaces                   0.7524     0.7507
 RandomForestClassifier   char      6       15000    remove_punctuation              0.7524     0.7505
 RandomForestClassifier   char      6       20000    remove_punctuation              0.7524     0.7500
 LGBMClassifier           char_wb   3       2500     None                            0.7524     0.7494
 LGBMClassifier           char_wb   3       2500     remove_punctuation              0.7524     0.7494
 LGBMClassifier           char_wb   3       2500     remove_spaces                   0.7524     0.7494
 LGBMClassifier           char_wb   3       2500     remove_numerical_punct_spaces   0.7524     0.7494
 ExtraTreesClassifier     char      6       20000    remove_numerical_punct_spaces   0.7514     0.7504
 LGBMClassifier           char_wb   3       5000     None                            0.7514     0.7496
 LGBMClassifier           char_wb   3       5000     remove_punctuation              0.7514     0.7496
 LGBMClassifier           char_wb   3       5000     remove_spaces                   0.7514     0.7496