<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentiment Analysis of YouTube comments in Dravidian Code- Mixed Language using Deep Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>N Muhammad Fadil</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lavanya S K</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computational Intelligence</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Computational Intelligence SRM Institute of Science and Technology</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sentiment analysis is a method for determining the positivity, negativity, or neutrality of a text block. The purpose of sentiment analysis is to study public sentiment in a manner that promotes company growth. This study seeks to classify the feelings of a dataset of comments/posts into pre-defined classifications for the code-mixed languages Tamil, Malayalam, and Kannada. The Sequential Deep Learning model is used to the code-mixed dataset to identify sentiments. The experiment was carried out using the dataset from the Codalab 2022 competition "Shared Task on Sentiment Analysis and Homophobia Detection of YouTube Comments in Code-Mixed Dravidian Languages", which included social media comments in Tamil, Malayalam, and Kannada code-mixed languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment</kwd>
        <kwd>Code-Mixed</kwd>
        <kwd>YouTube</kwd>
        <kwd>LSTM</kwd>
        <kwd>Keras</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        For the purpose of protecting social media users from cyberbullying, social media companies
have always been required to fund/contribute to sentiment analysis research. There have been
a number of studies examining models for sentiment analysis. Different fields utilize different
methodologies and models. However, few academic studies have examined the use of Emoji
characters on social media [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. When used out of context, emojis can drastically alter a
message. To successfully classify the comments in this experiment, we remove all emoji from
the dataset provided. Sentiment analysis at the word level studies the orientation of individual
words and phrases and how it influences the overall tone, whereas sentiment analysis at the
sentence level analyses sentences that reflect a single perspective and seeks to discern its
orientation. A lexicon-based method relies on a corpus or list of words with a particular
polarity. Then, an algorithm searches for these words, counts or estimates their weight, and
measures the text's overall polarity [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        Recent works on sentiment analysis of mixed-code formats have expanded in number [
        <xref ref-type="bibr" rid="ref10 ref5 ref6 ref7 ref8 ref9">5, 6, 7,
8, 9, 10</xref>
        ]. In a nation where multiple languages are spoken, code-mixing becomes widespread.
People in multilingual nations use code-mixed discourse when communicating online and in
person [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Sequence models are machine learning models that accept or produce data
sequences as input or output. Sequential data consists of text streams, audio and video
fragments, time-series data, and other types. Recurrent neural networks (RNNs) are commonly
employed in sequence modeling [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The study of discrete sequential data, including time
series, text phrases, and other sequential data, inspired the development of Sequence Models.
These models are better suited to manage sequential data, whereas Convolutional Neural
Networks are better suited to manage spatial data.
      </p>
      <p>
        Dravidian languages have been code-mixed with English in the current study, such as
"TamilEnglish", "Malayalam-English", and "Kannada-English" [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]. This dataset is part of
Task A of "Sentiment Analysis and Homophobia detection in YouTube comments." This study
classifies each YouTube comment into one of the following message-level categories:
"Positive," "negative," "not-tamil/malayalam/kannada," "unknown state," and
"mixedfeelings." The experimental results on the Sequential model for the supplied dataset revealed
an accuracy of 0.53 for "Malayalam-English".
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>
        The dataset utilized in this study is provided by Task A of "Sentiment Analysis and
Homophobia detection in YouTube comments." It includes YouTube comments written in
Tamil, Malayalam, and Kannada (data for all 3 from [
        <xref ref-type="bibr" rid="ref10 ref11 ref8">8, 10, 11</xref>
        ]). The training dataset for Tamil
includes 35656 instances, the validation/development dataset includes 3962 instances, and the
test dataset includes 649 instances. "Positive," "Negative," "unknown state," "Mixed feelings,"
and "not-Tamil" are the classes. There are 15888 instances in the Malayalam training dataset,
1766 instances in the validation/development dataset, and 1962 instances in the test dataset.
"Positive," "Negative," "unknown state," "Mixed feelings," and "not-malayalam" are the
classes. There are 6212 instances in the Kannada training dataset, 691 instances in the
validation/development dataset, and 768 instances in the test dataset. "Positive", "Negative",
"unknown state", "Mixed feelings", and "not-Kannada" are the classes.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In this paper, a multi-task classification model for sentiment analysis of YouTube comments
written in mixed-Dravidian code is developed. Each comment in the dataset must be
represented by a numerical feature vector for a supervised classifier to be trained.
4.1 Task
The objective is to classify each YouTube remark into one of five classes: "Positive",
"Negative", "unknown state", "Mixed feelings", and "not-Tamil/Malayalam/Kannada".</p>
      <sec id="sec-4-1">
        <title>4.2 Data Preprocessing</title>
        <p>Given that the YouTube dataset is code-mixed and defies grammatical standards. To
successfully utilize the dataset, the following procedures are implemented.</p>
        <p> The texts are initially transformed to lowercase and stemming and lemmatization are
performed.
 In the following phase, all emojis, special characters, numbers, and punctuation must
be removed because they serve no use in a statement.
 Sentences of two letters or less were deleted since they had minimal impact on the data
set.
 Next, the training dataset was compiled. After cleaning the text, it was tokenized and
encoded into a collection of token indexes.</p>
        <p> Finally, padding was used to verify that all texts were of equal length.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3 Model</title>
        <p>For sentiment analysis tasks, a DNN has been built. The input for these networks came from
the embedding vectors. Initially, word embedding was included to the model. The embedding
initializer, embedding regularizer, and embedding regularizer were all assigned as "maximum
length," "orthogonal," and "L2 Regularizer" After adding the LSTM layer, we wrapped it with
Bidirectional. Bidirectionality was added to a Keras layer by implementing
tf.keras.layers.bidirectional within the model. We classified the data into five classes using the
Dense layer and the'softmax' activation function..</p>
        <p>Our model was compiled and the loss function, optimizer, and metrics were defined. We select
"Categorical Cross-Entropy" as the loss function since the provided problem involves
multiclass categorization. We applied the default optimizer Adam and a learning rate of 0.01 to the
provided problem. We had previously used 'accuracy,' 'precision,"recall,' and 'auc' as
measures..</p>
        <p>Consequently, we must now train our model to fine-tune the parameters in order to provide the
required outputs for a given input. This is achieved by feeding inputs into the input layer,
receiving an output, calculating the loss function using the output, and then fine-tuning the
model parameters via backpropagation. Consequently, the parameters of the model will be
fitted and matched to the data. While fitting the model, the batch size was 256 and the number
of epochs was 2..</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Implementation</title>
      <p>The notebook file imports all necessary modules and packages, such as TensorFlow, pandas,
NumPy, Regular Expression, Natural Language Toolkit, scikit-learn, etc. Python's scikit-learn1
library is utilized for feature extraction and model training. Using scikit-Tfidf learn's
Vectorizer, the text input is turned into TF-IDF feature vectors. The Tamil, Malayalam, and
Kannada training sets are utilized to train the sequential model. The accuracy of the three
languages is calculated. The development set is utilized to determine the accuracy of the model.
The following table shows the accuracy, precision, recall, and f1-score for all three languages.</p>
      <sec id="sec-5-1">
        <title>Language</title>
        <p>Tamil
Malayalam
Kannada</p>
      </sec>
      <sec id="sec-5-2">
        <title>Accuracy Precision</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Conclusion</title>
      <p>The table presents the weighted "macro" averages for each statistic across all three languages.
For the term 'Tamil,' a precision of 0.40, an accuracy of 0.47, a recall of 0.515, and a f1-score
of 0.41 were observed. 'Malayalam' was given a precision of 0.53, an accuracy of 0.55, a recall
of 0.62, and a f1-score of 0.53. 'Kannada' was given a precision of 0.50, an accuracy of 0.57, a
recall of 0.53, and a f1-score of 0.51. The same model was utilized for all three languages.
Comparatively, the 'Malayalam' dataset had the highest precision, recall, and f1-score, although
the 'Kannada' dataset had slightly greater accuracy than the 'Malayalam' dataset. Compared to
the other two datasets, the "Tamil" dataset performed poorly with the model. However, it
should be noted that the 'Positive' class in the 'Tamil' dataset has a disproportionately large
number of instances relative to the other classes. The disparity in the data reduced the precision.
The extensive training and development data relative to the other two languages may have also
contributed to the low performance on the measures.</p>
      <p>As a result, we tested the three languages in this research using the DNN Sequential model.
This method can be applied to any language because it is language-independent.
McCrae, John. (2020). Overview of the track on Sentiment Analysis for
Dravidian Languages in Code-Mixed Text. 21-24. 10.1145/3441501.3441515.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Preisendorfer</surname>
            ,
            <given-names>Matthew.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Social Media Emoji Analysis</article-title>
          ,
          <source>Correlations and Trust Modeling. 10.13140/RG.2.2.25466.18888.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>Rong</given-names>
          </string-name>
          &amp; Chersoni, Emmanuele &amp; Lu, Qin &amp; Huang,
          <string-name>
            <surname>Chu-Ren</surname>
          </string-name>
          &amp;
          <article-title>Li, Wenjie &amp; Long</article-title>
          ,
          <string-name>
            <surname>Yunfei.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Lexical data augmentation for sentiment analysis</article-title>
          .
          <source>Journal of the Association for Information Science and Technology. 72. 10</source>
          .1002/asi.24493.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>Chenhao</given-names>
          </string-name>
          &amp; Lee,
          <string-name>
            <surname>Lillian</surname>
          </string-name>
          &amp; Tang, Jie &amp; Jiang, Long &amp; Zhou,
          <string-name>
            <surname>Ming</surname>
          </string-name>
          &amp; Li,
          <string-name>
            <surname>Ping.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>User-level sentiment analysis incorporating social networks</article-title>
          .
          <volume>10</volume>
          .1145/2020408.2020614.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Azeema</given-names>
            <surname>Sadia</surname>
          </string-name>
          , Fariha Khan and
          <string-name>
            <given-names>Fatima</given-names>
            <surname>Bashir</surname>
          </string-name>
          .
          <article-title>An Overview of LexiconBased Approach For Sentiment Analysis</article-title>
          .
          <year>2018</year>
          3rd
          <string-name>
            <given-names>International</given-names>
            <surname>Electrical Engineering Conference (IEEC 2018) Feb</surname>
          </string-name>
          ,
          <year>2018</year>
          at IEP Centre, Karachi, Pakistan.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jhanwar</surname>
            ,
            <given-names>Madan</given-names>
          </string-name>
          &amp; Das,
          <string-name>
            <surname>Arpita.</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>An Ensemble Model for Sentiment Analysis of Hindi-English Code-Mixed Data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ansari</surname>
            ,
            <given-names>Mohammed</given-names>
          </string-name>
          <string-name>
            <surname>Arshad</surname>
            &amp; Govilkar,
            <given-names>Sharvari.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Sentiment Analysis of Mixed Code for The Transliterated Hindi</article-title>
          and Marathi Texts.
          <source>International Journal on Natural Language Computing</source>
          .
          <volume>7</volume>
          . 10.5121/ijnlc.
          <year>2018</year>
          .
          <volume>7202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tho</surname>
            ,
            <given-names>Cuk &amp; Spits</given-names>
          </string-name>
          <string-name>
            <surname>Warnars</surname>
            , Harco Leslie Hendric &amp; Soewito, Benfano &amp; Gaol,
            <given-names>Ford.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Code-Mixed Sentiment Analysis Using Machine Learning Approach -</article-title>
          A
          <source>Systematic Literature Review. 1-6. 10.1109/ICICoS51170</source>
          .
          <year>2020</year>
          .
          <volume>9299004</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Patra</surname>
            ,
            <given-names>Braja</given-names>
          </string-name>
          &amp; Das,
          <string-name>
            <surname>Dipankar &amp; Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Amitava.</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Sentiment Analysis of Code-Mixed Indian Languages: An Overview of SAIL_Code-Mixed Shared Task @ICON-</article-title>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ansari</surname>
            ,
            <given-names>Mohammed</given-names>
          </string-name>
          <string-name>
            <surname>Arshad</surname>
            &amp; Govilkar,
            <given-names>Sharvari.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Sentiment Analysis of Mixed Code for the Transliterated Hindi and Marathi Texts</article-title>
          .
          <source>SSRN Electronic Journal</source>
          .
          <volume>10</volume>
          .2139/ssrn.3429694.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>Pruthwik</given-names>
          </string-name>
          &amp; Danda, Prathyusha &amp; Dhakras,
          <string-name>
            <surname>Pranav.</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Code-Mixed Sentiment Analysis Using Machine Learning and Neural Network Approaches</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>Pradeep</given-names>
          </string-name>
          &amp; Kumar,
          <string-name>
            <surname>Abhinav.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Sentiment Analysis on Tamil Code-Mixed Text using Bi-LSTM.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Qaddoura</surname>
            ,
            <given-names>Raneem</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Al-Zoubi</surname>
            , Ala &amp; Faris, Hossam &amp; Almomani,
            <given-names>Iman.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>A Multi-Layer Classification Approach for Intrusion Detection in IoT Networks Based on Deep Learning</article-title>
          .
          <source>Sensors (Basel, Switzerland)</source>
          .
          <volume>21</volume>
          . 10.3390/s21092987.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>Bharathi</given-names>
          </string-name>
          &amp; Stearns, Bernardo &amp; Arčan, Mihael &amp; Zarrouk,
          <string-name>
            <surname>Manel &amp; McCrae</surname>
          </string-name>
          , John &amp; Priyadharshini, Ruba &amp; Jayapal, Arun &amp; Sridarane,
          <string-name>
            <surname>Sridevy.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>Ankit</given-names>
          </string-name>
          &amp; Saumya, Sunil &amp; Kumar,
          <string-name>
            <surname>Abhinav.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Sentiment Analysis of Dravidian-CodeMix Language</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>Bharathi</given-names>
          </string-name>
          &amp; Priyadharshini, Ruba &amp; Muralidaran, Vigneshwaran &amp; Suryawanshi, Shardul &amp; Jose, Navya &amp; Elizabeth, Sherly &amp;
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>