<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection on YouTube using Ensemble Machine Learning Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sunil Saumya</string-name>
          <email>sunil.saumya@iiitdwd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vanshita Jha</string-name>
          <email>vanshitajha@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shankar Biradar</string-name>
          <email>shankar@iiitdwd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Central University of Rajasthan</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Information Technology Dharwad</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Internet users frequently express themselves through posts, comments, and articles. The examination of such posts/comments has recently attracted the research community's attention. Sentiment analysis and the identification of homophobic comments are two key research areas in this field. Sentiment analysis reveals that people's emotions reflect positive, negative, or mixed feelings about a certain topic or article. Further, Homophobia refers to a wide range of attitudes and feelings toward people who identify as homosexual, transgender, lesbian, gay, or queer. To encourage research in this direction, the organisers of the Dravidian LangTech shared task as part of FIRE 2022 have set two shared tasks. Task A consists of a message-level polarity detection problem, in which the given YouTube comments system has to recognise positive, negative, and mixed emotions. Task B involves detecting transphobic and homophobic YouTube comments. Our team participated in both subtasks; we worked on the Kannada dataset for sentiment analysis, and our best-performing model secured 11th place among the participating teams. For Task B, we participated in all four languages (Tamil, English, Malayalam, and Tanglish) and received 6, 6, 2, and 4th positions, respectively. In our proposed approach, we employed several Machine learning models, the Ensemble method and Deep learning models to achieve the desired result.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Social media websites, blogs, and microblogging sites have become very prominent in today’s
world, where people can easily share their thoughts and opinions on various real-time scenarios.
These websites have also become a source of all kinds of information. Naturally, these comments,
posts, and articles tend to infer diferent things for diferent people across the world. The
comments which are good for some people may not be in the best interest of others. Hence
there are various emotions on the same topic, post or issue. These sentiments can be classified
into Positive, Negative, Mixed feelings or Unknown states. Analysing each comment, post or
article in these categories is known as Sentiment Analysis. Nowadays, sentiment analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
has become very important in various fields like the market, film industry, gaming industry,
e-commerce [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] etc. Further, it helps the companies to find the sentiment of people about a
particular product or customer needs and understand feedback provided by the customers.
The application of sentiment analysis is present in almost all industries, which can be used to
understand the consumers’ sentiments and work accordingly.
      </p>
      <p>
        LGBTQ+ community refers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to the group/community of people who identify as lesbian,
gay, bisexual, transgender, or queer, all of the gender identities and sexual orientations that are
not specifically covered by the other five initials. Homophobia refers to the negative attitude
toward people identifying as homosexual, transgender and queer. As a result of homophobia
and transphobia, LGBTQ people may face considerable psychological stress, which will prevent
them from participating in normal social activities and may potentially result in major mental
illness. As a result, quick and efective detection and screening of homophobia and transphobia
on the Internet will help to clear cyberspace, create a pleasant and healthy online community,
and raise awareness of the unfair treatment of LGBTQ groups [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Several studies on sentiment analysis have been undertaken in recent years; however, most
of these studies have focused on high-resource languages such as English [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Furthermore,
relatively very few people have worked on regional south Indian languages [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. To encourage
research on this topic, DravidianLangTech organisers published data in south Indian languages
such as Kannada, Tamil, and Malayalam as part of the Fire 2022 proceedings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The shared task
organisers provided two subtasks: Task A focuses on sentiment analysis in Kannada language
YouTube comments, and Task B focuses on Homophobic comment recognition from social
media comments. Our team participated in both challenges and received a good ranking. This
article will provide the working notes for our proposed model.
      </p>
      <p>The rest of the article is organized as follows. The next section, 2 gives the brief overview of
the existing work. Further, section 3 provides the details of the given tasks and dataset statistics.
This is followed by the description of model used for experimentation in Section 4. The results
are explained in the Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background study</title>
      <p>Several studies on sentiment analysis and the moderation of homophobic content on social
media networks have been conducted; however, the majority have focused on high-resource
languages such as English. To organise the related work, we divided the background study into
two parts: section 2.1 provides a brief description of the model proposed for sentiment analysis,
and part 2.2 describes the model proposed for homophobic content moderation.
2.1. Models proposed for Sentiment analysis
[10] developed a novel framework for assessing the rating of internet reviews. The suggested
method detects polarity in online reviews by combining text processing and feature extraction
methods. The authors claim that their proposed strategy outperforms existing deep learning
methods. [11] used code-mixed text data from social media to identify sentiment. Their study
made use of two code-mixed datasets: English-Bengali and English-Hindi. They grouped the data
based on the statement’s polarity conflict, such as positive, negative, or neutral. The translation
and transliteration-based transformer model was developed by [12] to detect hateful comments
from social media networks [13, 14, 15]. [16] presented a novel Framework for predicting
discrepancies in Google App text comments and ratings using Deep Learning approaches. The
framework is divided into two phases. In the first step, the polarity of reviews is predicted
using a sentiment analysis algorithm. In the second step, star ratings are predicted from the text
format of reviews after deep learning models have been trained on the ground truth obtained in
the first phase.</p>
      <sec id="sec-2-1">
        <title>2.2. Models proposed for Homophobic content detection</title>
        <p>To extract homophobic information from social media data, [17] first convert code-mixed text
to monolingual, utilising a data augmentation and transliteration-based approach. [18] used
transformer-based XLM-Roberta to identify homophobia and transphobia data.TF-IDF vectorizer
combined with SVM model is used by [19] to identify homophobia content. The number of
monolingual and multilingual transformer models were experimented with data augmentation
by [20] for homophobia detection.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task and data description</title>
      <p>
        DravidianLangTech organised the shared task on sentiment analysis and homophobia
identification in YouTube comments [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ][21]; The shared task included two diferent sub-tasks: Task A is
Sentiment Analysis in Kannada, Malayalam, and Tamil, where we participated in the Kannada
dataset, Task B is the detection of homophobic texts in English, Tamil, Tamil English, and
Malayalam. The aim of sentiment analysis was to classify the code mixed data into positive,
negative, and mixed feelings and not in the intended language. Classifying the code-mixed
material into homophobic, transphobic, and non-anti-LGBTQ+ content was the goal of the
second assignment.
      </p>
      <p>The datasets for the competitions were made available in phases. Task A and task B training
and validation datasets were released initially; later, Test data was made available. The dataset
is collected from comments on popular YouTube channels. The dataset contains two fields: Text
and Label. The complete statistics of the data we investigated in our work are presented in
Table 1,2.
4. Methodology
The current paper used the multi-class classification approach for sentiment analysis and
homophobic and transphobic text detection. Several conventional machine learning models,
and ensemble methods were used to realise the goal. A detailed description of all the methods
is presented in the subsection below.</p>
      <sec id="sec-3-1">
        <title>4.1. Data cleaning and pre-processing</title>
        <p>The datasets were preprocessed before being fed into the models. The preprocessing is carried
out on the Text field. The numbers, punctuation, and symbols have been deleted from the text
because they do not help us predict the label. We also deleted white spaces; finally, the lower
casing of text is performed to avoid redundant data. The cleaned texts are then tokenized and
encoded into a series of token indexes.All of this preprocessing was done with the help of the
NLTK toolbox from the Python library 1. Furthermore, TF-IDF vectorization (n-gram vectors)
is performed, and vectorized data is used as input for diferent models. We also applied SMOTE
on vectorised data to balance the overall dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Classification Models</title>
        <p>We used diferent ensemble techniques, and traditional machine learning classifiers in the
proposed approach to predict the outcomes. The following sections provide comprehensive
details of each of these models.</p>
        <sec id="sec-3-2-1">
          <title>4.2.1. Conventional Machine leaning classifiers</title>
          <p>Initially, we experimented with diferent conventional machine learning models such as Logistics
Regression, Passive Aggressive classifier, Support vector machine (SVM), Random Forest and
Naïve Bayes to classify the text into their respective categories. We have used default parameters
provided by the sci-kit-learn library to train the models. The input for all these models was
taken from TF-IDF vectors created from the cleaned text. The model was developed using
Python’s sci-kit-learn library 2.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>4.2.2. Ensemble Machine Learning method</title>
          <p>We employed an ensemble setup in the model to increase the performance of classic machine
learning models. Three diferent ensemble approaches were used to classify the text: gradient
ensemble, stacking ensemble, and model selection ensemble. As weak learners, the stacking
ensemble included logistic regression,k nearest neighbour classifier, decision tree classifier,
Support vector Machine (SVM), and naive Bayes classifier. The logistic regression, random
forest classifier, and SVM were employed in the model selection and gradient boosting. The
TF-IDF vectoriser is used as the input for all of these models. The detailed Architecture of the
proposed model is illustrated in Fig 1.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>All experiments were conducted in the Keras and sklearn environments. To read the datasets,
we utilised the pandas library. The dataset was prepared using Keras preprocessing methods
and nltk library. Using sentiment and homophobic data provided by the task organisers, we
used K-fold cross-validation to train our proposed models. Experimental trials are used to
select the hyperparameter value K=5. Table 3 illustrates the findings of the sentiment analysis
performed on the Kannada dataset, and Table 4 provides homophobia results.</p>
      <p>For sentiment analysis using the Kannada dataset the best model was found to be the model
using stacking ensemble with the accuracy of 0.515. The stacking ensemble consisted of Logistic
Regression, KNeighbors Classifier, Decision Tree Classifier,SVM and Gaussian Naive Bayes as
the base models and Logistic Regression as the meta learner model. Diferent models were used
to detect homophobia for diferent datasets. The stacking ensemble produced the best results
on the Tamil dataset, with an accuracy of 0.762. In stacking ensemble Logistic Regression, K
nearest neighbours Classifier, Decision Tree Classifier, SVM, and Gaussian Naive Bayes were
included as base learners, with Logistic Regression serving as the meta learner model. Similarly,
the English dataset has given better results using the stacking ensemble model with an accuracy
of 0.966. On the other hand, the Malayalam dataset performed best with the Passive Aggressive
classifier, with an accuracy of 0.927. The model chosen for the Tamil English dataset was
gradient boosting, which produced an accuracy of 0.891.</p>
      <p>The organisers provided a weighted F1 score to evaluate the presented models. Our
topperforming Stacking ensemble model was ranked 11th and 6th among the participating teams
on Kannada, Tamil, and English datasets. Similarly, Passive Aggressive and gradient boosting
performed better on Malayalam and Tanglish data, ranking second and fourth, respectively.
Table 5 illustrates the final ranking of our proposed models among the participating teams.It
also includes the best F1 scores achieved among the participating teams.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion and Future work</title>
      <p>In our work, we presented a model submitted by our team for Sentiment analysis and
Homophobia content identification on You Tube comments in the Fire 2022 shared task. Our proposed
work evaluated two distinct models: a machine learning-based model and an ensemble setup
with machine learning classifiers as base learners. The experimental findings demonstrate that
ensemble models outperform diferent baseline models for stance detection. We can increase
the eficiency of the suggested modes by using context-aware domain-specific embeddings.
[10] G. S. Budhi, R. Chiong, I. Pranata, Z. Hu, Using machine learning to predict the sentiment
of online reviews: a new framework for comparative analysis, Archives of Computational
Methods in Engineering 28 (2021) 2543–2566.
[11] S. Ghosh, S. Ghosh, D. Das, Sentiment identification in code-mixed social media text, arXiv
preprint arXiv:1707.01184 (2017). doi:https://doi.org/10.48550/arXiv.1707.01184.
[12] S. Biradar, S. Saumya, et al., Fighting hate speech from bilingual hinglish speaker’s
perspective, a transformer-and translation-based approach., Social Network Analysis and
Mining 12 (2022) 1–10.
[13] S. Saumya, A. Kumar, J. P. Singh, Ofensive language identification in dravidian code
mixed social media text, in: Proceedings of the first workshop on speech and language
technologies for Dravidian languages, 2021, pp. 36–45.
[14] A. K. Mishra, S. Saumya, A. Kumar, Iiit_dwd@ hasoc 2020: Identifying ofensive content
in indo-european languages., in: FIRE (Working Notes), 2020, pp. 139–144.
[15] A. Kumar, S. Saumya, J. P. Singh, Nitp-ai-nlp@ hasoc-fire2020: Fine tuned bert for the
hate speech and ofensive content identification from social media., in: FIRE (Working
Notes), 2020, pp. 266–273.
[16] S. Sadiq, M. Umer, S. Ullah, S. Mirjalili, V. Rupapara, M. Nappi, Discrepancy detection
between actual user reviews and numeric ratings of google app store using deep learning,
Expert Systems with Applications 181 (2021) 115111.
[17] B. R. Chakravarthi, A. Hande, R. Ponnusamy, P. K. Kumaresan, R. Priyadharshini, How
can we detect homophobia and transphobia? experiments in a multilingual code-mixed
setting for social media governance, International Journal of Information Management
Data Insights 2 (2022) 100119.
[18] J. García-Díaz, C. Caparrós-Laiz, R. Valencia-García, Umuteam@ lt-edi-acl2022: Detecting
homophobic and transphobic comments in tamil, in: Proceedings of the Second Workshop
on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 140–144.
[19] N. Ashraf, M. Taha, A. Abd Elfattah, H. Nayel, Nayel@lt-edi-acl2022:
Homophobia/transphobia detection for equality, diversity, and inclusion using svm, in: Proceedings of the
Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp.
287–290.
[20] V. Bhandari, P. Goyal, bitsa_nlp@lt-edi-acl2022: Leveraging pretrained language models
for detecting homophobia and transphobia in social media comments, arXiv preprint
arXiv:2203.14267 (2022).
[21] K. Shumugavadivel, M. Subramanian, P. K. Kumaresan, B. R. Chakravarthi, B. B, S.
Chinnaudayar Navaneethakrishnan, L. S.K, T. Mandl, R. Ponnusamy, V. Palanikumar, M. Balaji J,
Overview of the Shared Task on Sentiment Analysis and Homophobia Detection of YouTube
Comments in Code-Mixed Dravidian Languages, in: Working Notes of FIRE 2022 - Forum
for Information Retrieval Evaluation, CEUR, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Medhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Korashy</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis algorithms and applications: A survey</article-title>
          ,
          <source>Ain Shams engineering journal 5</source>
          (
          <year>2014</year>
          )
          <fpage>1093</fpage>
          -
          <lpage>1113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Detection of spam reviews: a sentiment analysis approach</article-title>
          ,
          <source>CSI Transactions on ICT 6</source>
          (
          <year>2018</year>
          )
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Makhmudah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bukhori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Putra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A. B.</given-names>
            <surname>Yudha</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of indonesian homosexual tweets using support vector machine method</article-title>
          , in: 2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE), IEEE,
          <year>2019</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Moyano</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. del Mar</surname>
          </string-name>
          Sanchez-Fuentes,
          <article-title>Homophobic bullying at schools: A systematic review of research, prevalence, school-related predictors and consequences</article-title>
          ,
          <source>Aggression and violent behavior 53</source>
          (
          <year>2020</year>
          )
          <fpage>101441</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Ramadhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Goo</surname>
          </string-name>
          ,
          <article-title>Twitter sentiment analysis using deep learning methods, in: 2017 7th International annual engineering seminar (InAES)</article-title>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biradar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <article-title>Combating the infodemic: Covid-19 induced fake news recognition in social media networks</article-title>
          ,
          <source>Complex &amp; Intelligent Systems</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biradar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          , Iiitdwd@
          <fpage>tamilnlp</fpage>
          -acl2022:
          <article-title>Transformer-based approach to classify abusive content in dravidian code-mixed text</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chinnappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <article-title>Overview of the DravidianCodeMix 2021 shared task on sentiment detection in Tamil, Malayalam, and Kannada</article-title>
          ,
          <source>in: Forum for Information Retrieval Evaluation</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sampath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thangasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallathambi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Dataset for identification of homophobia and transophobia in multilingual youtube comments</article-title>
          ,
          <source>arXiv preprint arXiv:2109.00227</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>