<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NLP_SSN_CSE at HOPE2023@IberLEF : Multilingual Hope Speech Detection using Machine Learning Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Varsha Balaji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aishwarya Kannan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aishwarya Balaji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bharathi Bhagavath Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, Sri SivaSubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Lately, there has been an issue of vulgarity and negative comments on social media platforms like YouTube, Twitter, and Instagram. Offensive comments lead to conflicts among the users. This in turn hinders the reach of the positive aspects of social media to the people. The given task was to classify the data as hope or non-hope speech. YouTube comments and tweets that provided hope, positivity, and equality and those that did not provide these were used for the English and Spanish dataset respectively. To classify the data we used several machine learning models such as BERT:bert-base-multilingualuncased and bert-base-uncased, Random Forest, SVM, Logistic Regression, and Decision Tree. Out of these, mBERT produced the best results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>comments while desisting the negative ones.</p>
      <p>
        In this paper a part of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we have addressed Hope Speech Detection using some of
the pre-existing transformer models.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] We have obtained the dataset from the tweets
and YouTube comments in English, and Spanish. We have used basic models like Logistic
Regression, SVM, Decision Tree, Random Forest, and multilingual transformers such as BERT.
We have achieved a good result using the multilingual BERT transformer model. Hence the
task was solved using it. The overview of the HOPE2023@IberLEF: Multilingual Hope Speech
Detection task is given in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The remaining part of this paper is organized as follows. Section 2 discusses the other related
works on the Hope Speech Detection task. The dataset for this task is discussed in Section 3.
Sections 4 and 5 touch upon the features and methods used for this task. Results are written in
Section 6. Section 7 conveys the conclusion of the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        A lot of research work has been carried out to deal with Hope Speech Detection. The paper
PolyHope: Two-level hope speech detection from tweets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]focuses on the classicfiation of
the speech( binary and multi class). Eight traditional machine learning classifiers were used,
namely: LR, SVM with Radial Basis Function (RBF), and linear kernels, RFC, XGB, AdaBoost,
and Catboost, are used for the hope speech detection task. All the classifiers were used with
default parameters and were trained on the TF-IDF vectors of word uni-grams. Sentence
transformer-based hope speech detection for Equality, Diversity, and Inclusion is described
in [7]. Hope Speech Detection on multilingual YouTube comments via transformer based
approach: a paper on hope speech classification [ 8]. The classification was done for three
languages: English, Tamil, and Malayalam. Traditional models like SVM, Logistic Regressions,
and Naive Bayes as well as transformers like MT5 and BERT were used. Promising results were
obtained using multilingual BERT for Tamil and Malayalam, and BERT for English YouTube
comments. Hope Speech Detection using Machine Learning [9]: Here the balanced data was
ifrst passed through machine learning classifiers. Further, deep learning techniques like DNN,
DNN with embedding (DNN+Emb), CNN, LSTM, and BiLSTM were applied. The RF model
achieved the best performance with over-sampled data.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Analysis and Preprocessing</title>
      <p>Emojis, abbreviations, and small words are all permitted in YouTube comments. The data
must first be processed before being trained. Any machine learning solution must use a
Language Training Development Test including data pre-processing to be successful. Many
YouTube comments may contain misspelled words and indications of inconsistent text
continuity. Pre-processing is the removal of all HTML tags, hashtags, social media mentions,
and URLs in order to clean up the dataset and normalize these abnormalities. Emojis and
emoticons, which are crucial in characterizing the speech, must also be annotated. These are
taken out of the comment and replaced with the text they stand for. Short terms that may be
present in the text data are replaced with the full version of such words. We use a look-up table
to change short words into their extended forms, such as "what’s" becoming "what is" and "u"
becoming "you." After that, the series of texts are changed to lowercase, and any extraneous
white spaces are eliminated.</p>
      <p>[10] The Natural Language Toolkit (NLTK) was used in data preprocessing of the hope speech
tasks in natural language processing (NLP). For the study of hope speech, it provides a variety
of techniques and functions, including tokenization, stemming, lemmatization, and
part-ofspeech tagging. With the use of NLTK, unstructured text data may be converted into a format
that is appropriate for NLP modeling and analysis, giving researchers and developers the
ability to efcfiiently handle and analyze data from hope speech. The creation of speech-related
applications and NLP research benetfi greatly from NLTK’s extensive library of corpora, lexical
resources, and algorithms.</p>
      <sec id="sec-3-1">
        <title>3.1. Acquired Dataset</title>
        <p>The tables2 &amp; 3 below presents a description of the dataset examples, encompassing
both hope and non-hope speech instances in English and Spanish[11]. The table 2 has two
instances with the label "Model Label." The first instance is an emotion that is supportive
of homosexual wild TikTok films and expresses affection for them. The second example is a
resume for a network engineer who works as a teacher and expresses a desire to teach at a
university. The two instances have nothing to do with the goal of the hope speech.</p>
        <p>Additionally, the table 3 includes two samples with the heading "Model Label." In the rfist
illustration, people get together to celebrate and show their support for LGBTQ+ people.
The second illustration emphasizes opposition to fascist beliefs and support for free speech.
These instances emphasise inclusion and democratic principles while addressing the issue of
hope.The aim was to distinguish between the hopeful and uninspired.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Feature Extraction</title>
      <p>The process of turning raw data into a collection of pertinent characteristics that can be
utilized as input for machine learning models is known as feature extraction. This is frequently
applied in disciplines like image identification, natural language processing, and others where
numerical visualization of data is required.</p>
      <sec id="sec-4-1">
        <title>4.1. Count Vectorizer</title>
        <p>Text data is transformed into numerical feature vectors using the feature extraction approach
known as CountVectorizer in natural language processing. It operates by calculating a sparse
matrix from the frequency counts of each word in a document. Several machine-learning
models can be entered into the resulting matrix. The representation of text data as a
numerical input for machine learning models is simple yet efcfiient. [ 12] uses CountVectorizer in
fake news detection task that helps in improving the final accuracy of the model in which
CountVectorizer has been applied to.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. TF-IDF Vectorizer</title>
        <p>Term frequency-inverse document frequency, or TF-IDF The feature extraction method known
as "Vectorizer" is frequently applied in natural language processing to transform text data into
numerical feature vectors. It operates by evaluating each word’s significance in relation to the
corpus of documents as a whole. The resulting matrix gives each word’s frequency and weight
in each document in numerical form. The TF-IDF Vectorizer can be used as input for various
machine learning models and is helpful for locating significant words in a document. The
settings halt word removal and minimum document frequency can both be changed.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. BERT Encoding</title>
        <p>The pre-trained BERT model was adjusted on the provided dataset of hope speech in order to
apply BERT (Bidirectional Encoder Representations from Transformers) encoding for
Hopespeech classification. By feeding the dataset to the BERT model, which creates contextualized
word embeddings for each input sentence, this can be accomplished. Following that, a
classiifcation model—such as a neural network—learns to predict whether a given text contains
hate speech or not using the generated embeddings as input. When compared to simpler
feature extraction techniques, BERT’s capacity to include contextualized meaning can help
increase the accuracy of hope-speech classification. A model for detecting hate speech and
hope speech in text data can be created by fine-tuning BERT for a particular job of hope speech
classification.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <sec id="sec-5-1">
        <title>5.1. Random Forest classifier</title>
        <p>A Random Forest classifier is a meta-estimator that employs averaging to increase predicted
accuracy and reduce over-fitting after tfiting numerous decision tree classifiers to distinct
dataset subsamples. Each decision tree in the Random forest model is built using a subset of
characteristics and a subset of data points. Simply described, the data set containing k records
is divided into n random records and m features. For the samples, individual decision trees are
constructed generating the specific outputs. The resultant output of the data set samples is
generated based on averaging. The model trained for the given dataset generated an accuracy
of 92.40% with 90% F1-score and 91% precision.
5.2. SVM
SVM (Support Vector Machines) algorithms can be used for both regression and classification
problems. The given dataset is a classification-based problem, A model is created by an SVM
classifier that classes fresh data points into one of the predetermined categories. As a result, it
can be thought of as a binary linear non-probabilistic classifier.SVMs are applicable to linear
classification tasks. The model trained for the given dataset generated an accuracy of 89.12%
with 90% F1-score and 89% precision for the English language. Using the kernel approach,
SVMs may effectively do non-linear classicfiation in addition to linear classification. It allows
us to automatically map the inputs into large feature areas.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Logistic Regression</title>
        <p>Machine learning uses the categorization method known as Logistic regression. The
dependent variable is modeled using a logistic function. Because the dependent variable is
dichotomous, there are only two conceivable classifications it could belong to (for example,
cancer can either be malignant or not). This method is therefore employed while working with
binary data. The sigmoid function is used in logistic regression to convert predicted values
to probabilities. This function turns any real value between 0 and 1 into another value. The
model trained for the given dataset generated an accuracy of 92.10% with 90% F1-score and
91% precision for the English language.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. Decision Tree</title>
        <p>The non-parametric supervised learning approach used for classification and regression
applications is the Decision Tree. It is organized hierarchically and has a root node, branches,
internal nodes, and leaf nodes. By using a greedy search to nfid the ideal split points inside
a tree, decision tree learning uses a divide-and-conquer technique. When most or all of
the records have been classified under distinct class labels, this splitting procedure is then
repeated in a top-down, recursive fashion. The model trained for the given dataset generated
an accuracy of 89.91% with 90% F1-score and 90% precision for the English language.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.5. BERT : bert-base-multilingual-uncased and bert-base-uncased</title>
        <p>It is a pre-trained model that was rfist described in the publication BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding [13]. It was trained using a masked
language modeling (MLM) goal on the top 102 languages with the largest Wikipedia. It is a
transformer model that has been previously self-supervised by trained on a sizable corpus of
multilingual data. BERT employs bi-directional learning to simultaneously understand word
context from the left to the right. The [14] Masked Language Modelling (MLM) technique,
which involves randomly masking 15% of the input’s words before putting it through the model
to forecast the masked words, is best suited for this bidirectional approach. Additionally, it
aids in the optimization of Next Sentence Prediction (NSP), which foretells the relationship
(whether they will follow one another or not) between two phrases. The bert-base-uncased
model trained for the given dataset generated an accuracy of 89.91% with 90% F1-score and
90% precision for the English language.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Observation</title>
      <p>In this part, we will examine how well different machine learning supervised models perform
for the two languages (English and Spanish). The weighted F1 score determines the excellence</p>
      <sec id="sec-6-1">
        <title>Model</title>
        <p>Random Forest</p>
        <p>Decision Tree
Logistic Regression</p>
        <p>SVM
Random Forest</p>
        <p>Decision Tree
Logistic Regression</p>
        <p>SVM
bert-base-multilingual-uncased
bert-base-uncased</p>
      </sec>
      <sec id="sec-6-2">
        <title>Feature Extraction</title>
        <p>Count Vectorizer
Count Vectorizer
Count Vectorizer
Count Vectorizer
TFIDF Vectorizer
TFIDF Vectorizer
TFIDF Vectorizer
TFIDF Vectorizer
Count Vectorizer
Count Vectorizer
of the models. The tables below present the evaluation results of all the models on the
training dataset. The model used to predict accuracy for the training dataset include Random
Forest classifier, SVM, Logistic Regression, and Decision Tree with respect to the classification
algorithms, and transformer BERT models including the bert-base-multilingual-uncased and
bert-base-uncased. Among all the models trained, bert-base-multilingual-uncased gave the
optimal results for the English and Spanish dataset with a weighted F1 score of 92.87% and
96.57% respectively. The Logistic Regression and Random Forest classifiers have provided
similar F1 scores of 92.10% and 92.07% for the English dataset using the TF-IDF vectorizer.
However, the results obtained for Spanish were comparatively low.</p>
        <p>The tables given above depict the classification report of various classicfiation models that
were obtained for the training dataset. Tables 4 and 5 represent the results and accuracy
obtained for the training dataset. Among all the models tested for the training dataset BERT
model, resulting in better. The tables below present the evaluation results of all the models on
the test dataset.</p>
        <p>The test dataset as in table 6 results generated an F1 score of 0.5913 for Spanish and 0.4937
for English. This model can be further improved to deal with data in multiple languages in the
future.
We secured eighth and ffith positions in the leader board for the Spanish and English dataset
respectively. The F1 score for the Spanish dataset was 59.14% wherein the highest was 91.61%
and the F1 score for the English dataset was 49.37% wherein the highest was 50.12%.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The requirement for Hope Speech Detection in social media is increasing. Nearly 75% of
connections today happen to be online. This seeks to differentiate between hope and
nonhope speech to promote a good atmosphere and shape human minds instead of making them
feel bad and low about themselves. Hope Speech Detection models, although important,
have inadequate amounts of research done on them. In this paper, pre-trained multilingual
transformer models are used to detect Hope Speech in 2 languages, namely English and
Spanish.
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and
Inclusion, 2022, pp. 378–388.
[7] B. Bharathi, D. Srinivasan, J. Varsha, T. Durairaj, B. Senthilkumar,
Ssncse_nlp@lt-ediacl2022:hope speech detection for equality, diversity and inclusion using sentence
transformers, in: LTEDI, 2022.
[8] S. Arunima, A. Ramakrishnan, A. Balaji, D. Thenmozhi, et al., ssn_dibertsity@
lt-edieacl2021: hope speech detection on multilingual youtube comments via transformer
based approach, in: Proceedings of the First Workshop on Language Technology for
Equality, Diversity and Inclusion, 2021, pp. 92–97.
[9] P. Roy, S. Bhawal, A. Kumar, B. R. Chakravarthi, Iiitsurat@ lt-edi-acl2022: Hope speech
detection using machine learning, in: Proceedings of the Second Workshop on Language
Technology for Equality, Diversity and Inclusion, 2022, pp. 120–126.
[10] E. Loper, S. Bird, Nltk: The natural language toolkit, arXiv preprint cs/0205028 (2002).
[11] D. García-Baena, M. Á. García-Cumbreras, S. M. Jiménez-Zafra, J. A. García-Díaz,
R. Valencia-García, Hope speech detection in spanish: The LGBT case, Language
Resources and Evaluation (2023) 1–28.
[12] A. Patel, K. Meehan, Fake news detection on reddit utilising countvectorizer and term
frequency-inverse document frequency with logistic regression, multinominalnb and
support vector machine, in: 2021 32nd Irish Signals and Systems Conference (ISSC), IEEE,
2021, pp. 1–6.
[13] F. Souza, R. Nogueira, R. Lotufo, Bertimbau: pretrained bert models for brazilian
portuguese, in: Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande,
Brazil, October 20–23, 2020, Proceedings, Part I 9, Springer, 2020, pp. 403–417.
[14] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual bert?, arXiv preprint
arXiv:1906.01502 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Hopeedi: A multilingual hope speech detection dataset for equality, diversity, and inclusion</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>García-Cumbreras</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>García-Baena</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>García-Díaz</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Valencia-García</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ureña-López</surname>
          </string-name>
          , Overview of HOPE at IberLEF 2023:
          <article-title>Multilingual Hope Speech Detection</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Montes-y Gómez, Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2023), co-located with the 39th Conference of the Spanish Society for Natural Language Processing (SEPLN 2023), CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Polyhope: Two-level hope speech detection from tweets</article-title>
          ,
          <source>Expert Systems with Applications</source>
          (
          <year>2023</year>
          )
          <fpage>120078</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Shorey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Rand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Feldman</surname>
          </string-name>
          , Hope theory, measurements, and applications to school psychology.,
          <source>School psychology quarterly 18</source>
          (
          <year>2003</year>
          )
          <fpage>122</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á. García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valencia-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the shared task on hope speech detection for equality, diversity, and inclusion</article-title>
          , in:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>