<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Homophobia, Transphobia Detection in Tamil, Malayalam, English Languages using Logistic Regression and Code-Mixed Data using AWD-LSTM ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>SIVAPRASATH S</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lavanya Sambath Kumar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sajeetha Thavareesan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eastern University</institution>
          ,
          <country country="LK">Sri Lanka</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SRM Institute of Science and Technology</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the submission of the shared task “Homophobia, Transphobia Detection of YouTube Comments” organized by DravidianLangTech. Our team has participated in Task - B, which tries to identify the comments on youtube are Non-anti LGBTQ+ content or Homophobic or Transphobic in code-mixed(Tamil-English), Tamil, Malayalam, and English. We proposed the AWD-LSTM model for the code-mixed(Tamil-English) language data set and Logistic Regression for Tamil, Malayalam, and English languages. Our AWD-LSTM model achieved a 0.33 macro average F1 score for code-mixed(Tamil-English) language and Logistic Regression achieved a 0.55 macro average F1 score in the Tamil language, 0.98 macro average F1-score in the Malayalam language, 0.91 macro average F1-score in the English language.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Homophobia</kwd>
        <kwd>code-mix</kwd>
        <kwd>AWD-LSTM</kwd>
        <kwd>YouTube</kwd>
        <kwd>Logistic Regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Internet users have access to a vast amount of information that can be of great benefit to them
for a variety of reasons. Online social media platforms have been credited with igniting a
new phase of "misinformation" by sharing wrong or incorrect information with the intention
of misleading users. On the well-known application “YouTube”, users can create their own
accounts, upload videos, and leave comments [17]. In recent decades, as an increasing number
of people have gained access to the internet, it has become highly crucial to track their activity
in order to recognize behavior that is ofensive towards any particular group based on sex, or
sexual orientation to maintain the Internet’s inclusivity, support diversity in ideas and content,
and promote innovation. This task has 4 sub-tasks. First is Homophobia, Transphobia detection
in English Language, Second in Malayalam Language, third in Tamil Language and fourth in
code-mixed(Tamil-English) language. At first, English was the only language that could be
used for communication. We are fortunate to be able to communicate with one another in any
language. Because of this, many individuals now speak a hybrid language that combines English
with their own original language, also known as their mother tongue [13, 18, 19]. The practice
of changing between two or more diferent languages during communication is known as
codemixing. The most prevalent language is Hindi-English, but this challenge included code-mixed
texts Tamil and English [15]. Tamil and Malayalam are members of the Dravidian language
family. One of the most widely used social media sites is YouTube, where anyone may publish
information about anything without limits. Transphobic and homophobic content indicates
statements expressing hatred for lesbian, gay, transgender, and bisexual individuals. It results in
ofensive speech and generates major societal dificulties that can make online platforms toxic
and unpleasant for LGBT+ individuals. This information may contain controversial information
against LGBTQ+ members also. So, in order to prevent this happen again, we need to classify
the text and delete the comments which are transphobic and homophobic automatically. We
must categorize this data whether it is non-LGBTQ+ content or homophobic or Transphobic
since false information spreads among individuals more quickly than accurate news [20, 21].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>T.Mandl et.al has created a model for analyzing hate speech in German, Hindi, and English.
They employed a model called long-term short-term memory. They achieved a macro - F1 score
of 0.78 for English, 0.81 for Hindi, and 0.61 for German [12]. TF-IDF-oriented multi-view SVM
as well as keyword-based approaches. This method is superior to other neural methods in its
efectiveness. It employs multiple view stacked SVM(Support Vector Machines). This model got
0.80 accuracy in the Stormfront data [11].</p>
      <p>In their presentation on the detection of ofensive comments in the Manglish data set, Sara
Renjit and Sumam Mary Idicula use Keras embedding layers with LSTM layers. Ofensive and
non-ofensive comments were classified based on the embedding model. They have got 0.53
weighed average F1 score based on Keras embeddings and 0.48 weighed average F1 score based
on Doc2Vec [13]. A corpus for performing sentiment analysis using the Manglish language
was provided by B. R. Chakravarthi and colleagues. For Manglish(Code-mixed), they presented
a gold standard corpus. This corpus has got Krippendorf’s alpha of more than 0.8. Volunteers
were responsible for annotating the data set [3].</p>
      <p>A data set with Hindi and English(code-mixed) was provided by Prabhu et.al. They have
implemented LSTM architecture representation at the sub-word level. They have collected
comments from Facebook pages that are publicly available and created their dataset. They
achieved an F1-score of 0.658 in their dataset and an F1-score of 0.537 in the SemEval’13 data
set. They were able to attain an accuracy that was 4-5% higher than other methods [9]. Nsrin
Ashraf et.al employed an SVM model to detect homophobia and transphobia, and they used
TF-IDF to vectorize the data. They obtained weighed F1-scores and accuracy of 0.92 and 0.92
for Tamil respectively, 0.88 and 0.90 for Tanglish(code-mixed) respectively, and 0.91 and 0.94 for
English respectively [2].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data set description</title>
      <p>The data set for the YouTube comments in Tamil, Malayalam, English, Code -
Mixed(TamilEnglish) has been given by the organizer. There are three labels in each data set. External data
sets are not used, and labels are not given for any of the test data sets.</p>
      <p>The labels given in these data sets are,
• Non-anti LGBTQ+ content
• Homophobic
• Transphobic</p>
    </sec>
    <sec id="sec-4">
      <title>4. Text cleaning</title>
      <p>Some fundamental pre-processing operations have been done in this phase. The purpose of
this phase is to remove unnecessary information from the raw text. All the stop words, URLs,
symbols $,? @, =, etc, and emojis were removed.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Proposed methods</title>
      <p>For the code-mixed(Tamil-English) data set AWD-LSTM model is used. It is one of the most used
language models. It is used to predict and analyze the sequence, especially in natural language
processing. AWD-LSTM uses DropConnect, along with a few other well-known regularisation
algorithms, including a variation of Average-SGD (NT-ASGD) and average random gradient
descent method, are used. It introduced the variant SGD—ASGD. The ASGD algorithm employs
the same gradient update step as the SGD technique, but it delivers an average value rather
than the weight determined in the most recent iteration.</p>
      <p>For Tamil, Malayalam, and English languages, the Logistic Regression algorithm is used. The
logistic regression classifier builds a frequency dictionary mapping using a list of words as input.
After that, the logistic regression classifier will be trained while the cost function is minimized.
Following training, test data sets can be fed into the system to make predictions.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Implementation</title>
      <p>All the necessary modules and packages like Numpy, pandas, TensorFlow, RegEx, scikit learn
are imported.</p>
      <sec id="sec-6-1">
        <title>6.1. Code-Mixed(Tam-Eng)</title>
        <p>Data set is imported and it is pre-processed. In pre-processing, the upper cases were converted
to lower cases and unwanted columns were removed. We have used fast text word embeddings.
Using a code-mixed sentence piece tokenizer, all the text was tokenized. We have used a
pretrained AWD-LSTM model which is already trained on another dataset. There are seven layers,
they are BatchNorm 1d, Dropout, Linear, ReLU, BatchNorm 1d, Dropout, and Linear. The first
layer batchNorm 1d with a number of features 1200, the second layer Dropout with rate 0.2, the
third layer Linear with in_features 1200, out_features 50, the fourth layer ReLU, the fifth layer
BatchNorm 1d with a number of features 50, sixth layer Dropout with rate 0.1, seventh layer
Linear with in_features 50, out_features 3.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Tamil, Malayalam and English</title>
        <p>All the packages and modules were imported. The data set is imported and emojis and stop words
from the text were removed. All the special characters were removed and uppercase letters were
converted to lowercase letters. All the input texts were tokenized. Tfidf Vectorizer was used for
vectorizing the text. The Logistic Regression model was trained with class_weight=’balanced’,
C =5, and the model used for prediction.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Results and Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. Code-Mixed(Tam-Eng)</title>
        <p>The metrics accuracy, precision, recall, and macro average F1-score have been calculated for
Code-mixed(Tam-Eng) languages. The accuracy of 0.87, macro precision of 0.35, recall of 0.34,
and macro average F1-score of 0.33 were recorded for Code-mixed(Tami-Eng) language. Due to
class imbalance in the data set, this model performs comparatively poorly. Macro average F1
scores of other teams with our team are shown in Table 1 for code-mixed(Tam-Eng) language.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Tamil, Malayalam and English</title>
        <p>The metrics accuracy, macro average precision, macro average recall, and macro average
F1score have been calculated for Tamil, Malayalam, and English languages.</p>
        <p>Macro average F1 scores of other teams with our team are shown in Table 2 for the Tamil
language.</p>
        <p>The accuracy of 0.55, macro average precision of 0.37, macro average recall of 0.49, macro
average F1-score of 0.31 were recorded for the Tamil language. Our model performed comparatively
better than other teams for the Tamil Language.</p>
        <p>Macro average F1 scores of other teams with our team are shown in Table 3 for the Malayalam
language.</p>
        <p>The accuracy of 0.98, macro average precision of 0.98, macro average recall of 0.96, and
F1-score of 0.97 were recorded for the Malayalam language. Our model’s performance is better
than other teams for the Malayalam language.</p>
        <p>Macro average F1 scores of other teams with our team are shown in Table 4 for the English
language.</p>
        <p>The accuracy of 0.91, macro average precision of 0.42, macro average recall of 0.42, and
F1-score of 0.42 were recorded for the English language. Our model performed comparatively
better than other teams for the English Language.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>This shared task aims at detecting homophobia, and transphobia detection of YouTube comments
on Tamil, Malayalam, English, and Code-mixed languages. There were total 9,9,10 and 10 teams
that participated in code-mixed, Tamil, Malayalam, and English languages respectively. We
have used the AWD-LSTM model for code-mixed(Tam-Eng) language and achieved an accuracy
of 0.87. For Tamil, Malayalam, and English languages we used Logistic regression and achieved
an accuracy of 0.55, 0.98, and 0.91 respectively. Our team’s macro average F1 score for the
Malayalam language is better compared with other teams. In the future, we will consider
improving the model’s performance by class balancing in the given data.
of Homophobia and Transophobia in Multilingual YouTube Comments. arXiv preprint
arXiv:2109.00227.
[5] Dahiya, A., Battan, N., Shrivastava, M., &amp; Sharma, D. M. (2019, August). Curriculum
Learning Strategies for Hindi-English Code-Mixed Sentiment Analysis. In International Joint
Conference on Artificial Intelligence (pp. 177-189). Springer, Cham.
[6] Fortuna, P., &amp; Nunes, S. (2018). A survey on automatic detection of hate speech in text.</p>
      <p>
        ACM Computing Surveys (CSUR), 51(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 1-30.
[7] Gupta, D., Tripathi, S., Ekbal, A., &amp; Bhattacharyya, P. (2017). SMPOST: parts of speech
tagger for code-mixed indic social media text. arXiv preprint arXiv:1702.00167.
[8] Jose, N., Chakravarthi, B. R., Suryawanshi, S., Sherly, E., &amp; McCrae, J. P. (2020, March). A
survey of current datasets for code-switching research. In 2020 6th international conference
on advanced computing and communication systems (ICACCS) (pp. 136-141). IEEE.
[9] Joshi, A., Prabhu, A., Shrivastava, M., &amp; Varma, V. (2016, December). Towards sub-word level
compositions for sentiment analysis of hindi-english code mixed text. In Proceedings of
COLING 2016, the 26th International Conference on Computational Linguistics: Technical
Papers (pp. 2482-2491).
[10] Kedia, K., &amp; Nandy, A. (2021). indicnlp@ kgp at DravidianLangTech-EACL2021: Ofensive
language identification in Dravidian languages. arXiv preprint arXiv:2102.07150.
[11] MacAvaney, S., Yao, H. R., Yang, E., Russell, K., Goharian, N., &amp; Frieder, O. (2019). Hate
speech detection: Challenges and solutions. PloS one, 14(8), e0221152.
[12] Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C., &amp; Patel, A. (2019,
December). Overview of the hasoc track at fire 2019: Hate speech and ofensive content
identification in indo-european languages. In Proceedings of the 11th forum for information
retrieval evaluation (pp. 14-17).
[13] Renjit, S., &amp; Idicula, S. M. (2020). CUSATNLP@ HASOC-Dravidian-CodeMix-FIRE2020:
      </p>
      <p>
        Identifying Ofensive Language from ManglishTweets. arXiv preprint arXiv:2010.08756.
[14] Shumugavadivel, Kogilavani and Subramanian, Malliga and Kumaresan, Prasanna Kumar
and Chakravarthi, Bharathi Raja and B, Bharathi and Chinnaudayar Navaneethakrishnan,
Subalalitha and S.K, Lavanya and Mandl, Thomas and Ponnusamy, Rahul and Palanikumar,
Vasanth and Balaji J, Manoj. Overview of the Shared Task on Sentiment Analysis and
Homophobia Detection of YouTube Comments in Code-Mixed Dravidian Languages. In
proceedings of dravidiancodemix-2022.
[15] Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., &amp; Weiss, D. (2018). A fast,
compact, accurate model for language identification of codemixed text. arXiv preprint
arXiv:1810.04142.
[16] Zhu, Y., &amp; Zhou, X. (2020). Zyy1510@ HASOC-Dravidian-CodeMix-FIRE2020: An
Ensemble Model for Ofensive Language Identification. In FIRE (Working Notes) (pp. 397-403).
[17] Subramanian, M., Ponnusamy, R., Benhur, S., Shanmugavadivel, K., Ganesan, A., Ravi,
D., Shanmugasundaram, G.K., Priyadharshini, R. and Chakravarthi, B.R., 2022. Ofensive
language detection in Tamil YouTube comments by adapters and cross-domain knowledge
transfer. Computer Speech &amp; Language, 76, p.101404.
[18] Shanmugavadivel, K., Sampath, S.H., Nandhakumar, P., Mahalingam, P., Subramanian, M.,
Kumaresan, P.K. and Priyadharshini, R., 2022. An analysis of machine learning models for
sentiment analysis of Tamil code-mixed data. Computer Speech &amp; Language, p.101407.
[19] Chakravarthi, B.R., Hande, A., Ponnusamy, R., Kumaresan, P.K. and Priyadharshini, R.,
2022. How can we detect Homophobia and Transphobia? Experiments in a multilingual
code-mixed setting for social media governance. International Journal of Information
Management Data Insights, 2(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), p.100119.
[20] Chakravarthi, B.R., 2022. Hope speech detection in YouTube comments. Social Network
      </p>
      <p>
        Analysis and Mining, 12(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), pp.1-19.
[21] Chakravarthi, B.R., 2022. Multilingual hope speech detection in English and Dravidian
languages. International Journal of Data Science and Analytics, 14(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), pp.389-406.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Gauravaroin@ HASOC-Dravidian-CodeMix-FIRE2020: pre-training ULMFiT on synthetically generated code-mixed data for hate speech detection</article-title>
          . arXiv preprint arXiv:
          <year>2010</year>
          .
          <year>02094</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ashraf</surname>
            , N.the , Taha,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Abd</given-names>
            <surname>Elfattah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , &amp;
            <surname>Nayel</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          (
          <year>2022</year>
          , May).
          <article-title>Nayel@ lt-edi-acl2022: Homophobia/transphobia detection for Equaliiversity, and Inclusion using Svm</article-title>
          .
          <source>In Proceedings of the Second Workshop on Language Technology for Equality, Diversit the and Inclusion</source>
          (pp.
          <fpage>287</fpage>
          -
          <lpage>290</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          , Jose, N.,
          <string-name>
            <surname>Suryawanshi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sherly</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English</article-title>
          . arXiv preprint arXiv:
          <year>2006</year>
          .00210.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Priyadharshini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponnusamy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumaresan</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sampath</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thenmozhi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thangasamy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nallathambi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <year>2021</year>
          . Dataset for Identification
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>