<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PITS@Dravidian-CodeMix-FIRE2020: Traditional Approach to Noisy Code-Mixed Sentiment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikita Kanwar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Megha Agarwal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajesh Kumar Mundotiya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology (BHU)</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pratap Institute of Technology &amp; Science</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sentiment Analysis (SA) is a process for characterizing the response or opinion by which sentiment polarity of the text is decided. Nowadays, social media is a common platform to convey opinions, suggestions and much more in a user's native language or multilingual in Roman script (for ease). In this task, Malayalam-English and Tamil-English code mixed dataset in the Roman script has provided for SA. To solve this task, we have generated syntax-based features and used trained logistic regression with as an under-sampling technique. We have obtained best F1-score of 0.71 and 0.62 on the blind test set of Malayalam-English and Tamil-English code mixed datasets, respectively. The code is available at Github1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Malayalam-English code-mixed</kwd>
        <kwd>Tamil-English code-mixed</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Logistic Regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sentiment Analysis (SA) is a process for characterizing the response or opinion, which
determines sentiment polarity of the text. In the last few years, social media platforms such as
Facebook, Twitter and Youtube have become increasingly large, which in turn produced textual
data as people express their feelings and opinions by writing reviews, comments on social
media.</p>
      <p>
        A large number of texts on such platforms are available in either user’s native language,
English or a mixture of both. According to Myers-Scotton (1993), code-mixing is defined as –
“The interchangeable use of linguistic units, like morphs, words, and phrases from one language
to another language while conversation (both speaking and writing)” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The code-mixing text
does not follow the formal grammar or even writing script. The use of writing script entirely
depends upon the user. Hence, it can be directly stated that the traditional approaches for
SA does not provide an efective solution. However, this problem becomes more complicated
when code-mixing text follows multilingualism, which is what most Indian users do nowadays.
Consequently, processing of code-mixed text has been gaining a propagation of attention and
interest in the NLP community [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6">2, 3, 4, 5, 6, 7, 8</xref>
        ].
      </p>
      <p>In this paper, we have worked with such kinds of code-mixed texts of the Dravidian languages
(Malayalam and Tamil) with English for SA by using the traditional approach with syntactic
features.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Over the past few years, social media has gained a boost in the code-mixed or code-switched text
after getting multilingual support. Therefore, code-mixed SA also has a significant problem. This
problem has been solved by two diferent methods, namely lexicon and machine learning [ 9].
Sharma et al. [10], Pravalika et al. [11], Baccianella et al. [12] have used lexicon-based approaches
for SA on code-mixed dataset. Whereas, traditional machine learning-based approaches such as
Naive Bayes, Support Vector Machine, Decision Tree and many more with hand-crafted (Sarkar
[13], Baccianella et al. [12]) and syntactic-based features (Chakravarthi et al. [7, 8], Remmiya Devi
et al. [14], Kouloumpis et al. [15]) provided significant results on the code-mixed dataset.</p>
      <p>However, nowadays, many researchers have been approaching this problem through deep
learning methods, which are also able to capture computational aspects to some extent [9].
Mishra et al. [16] has built a multi-layer perceptron and bidirectional-long short term memory
(LSTM) with Glove embeddings to perform SA on Hindi-English and Bengali-English datasets.
Joshi et al. [17] tried to leverage the subword information in the deep learning-based model
on the Hindi-English dataset, which was later extended by Mukherjee [18]. In this extension,
the author used an LSTM followed by a Convolutional Neural Network (CNN) for performing
joint learning between word and character-level features. These distributional representations
capture semantic information at particular extent, i.e. till window size in word embeddings or a
certain length of the input sentence by the variants of recurrent neural network due to gradient
vanishing problem. Nowadays, the contextual word embeeding techniques are prominent in
this case. BERT and ELMo are contextual word embeddings that are introducing new baseline
goals for SA. However, the performance of these embeddings with deep-learning models sufers
for code-mixed datasets [7, 8].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Features and Technique</title>
      <p>We have generated 40, 000 syntactic features by the combination of word and character-based
information. The n-gram technique helps to leverage such information. On each word of the
text, unigram and bigram were generated without removing any stop words or stemming.
Similarly, n-gram was also implemented at the character level, and the word boundary was
kept in mind. These n-gram features were encoded by Term Frequency-Inverse Document
Frequency (TF-IDF) to generate feature space, where each YouTube’s comment was considered
a document.</p>
      <p>This feature space was used to train the logistic regression model with an L1-regularizer.
However, the provided datasets are imbalanced as shown in Figure 1; hence we used the
undersampling technique, i.e., Tomek’s link [19]. Tomek’s link calculates the distance among the
class-wise samples and finds the nearest samples by using the nearest neighbour supervised
technique and removing it from majority classes.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The datasets used in this experiment are collected from YouTube comments in
MalayalamEnglish and Tamil-English as code-mixed in the Roman script, which contains 6, 739 and
15, 744 texts, respectively. Both the code-mixed datasets follow Tag switching, Intra- and
Inter-Sentential switch [7, 8]. The division of datasets to training, validation and testing are
summarized in Table 2. These datasets have annotated into five categories, namely Positive,
Negative, Mixed-feeling, Unknown-state, Not Malayalam or Not Tamil for the SA. The
distribution of categories is imbalanced in the combinations of provided training and validation dataset,
as shown in Figure 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Settings</title>
        <p>Traditional machine learning approaches are a prominent method for providing robust solutions
on a scarce dataset. Hence we have performed combinations of features and classification
techniques. We have cleaned the dataset by removing emojis and smiles through the
tweetpreprocessor1 before generating the features. Such cleaning degrades the model performance
in our experiments. The generated features are word length, character and word n-grams, word
1https://pypi.org/project/tweet-preprocessor/
repetitions, word count and presence of punctuation used with diferent classification techniques
namely Decision Tree, Support Vector Machine, Catboost, XGBoost, Logistic Regression.</p>
        <p>The Logistic Regression with under-sampling technique provides best results on the default
value of parameters in the sklearn library2 on the validation dataset by using the word, and
character-based n-gram features. However, the Tamil-English has not shown any efect on
model performance. Here, we have considered bigram for word-level and bigram to six-gram
for character-level features. Out of the yielded features, 1, 000 word-level and 30, 000
characterlevel features have been used.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analysis</title>
      <p>After applying logistic regression on the encoded TF-IDF word and character-based n-gram
features, we obtained the F1-score of 0.69, 0.71 and 0.64, 0.62 on the validation and blind test
dataset of Malayalam-English and Tamil-English, respectively. The evaluations mentioned in
Table 2, considers three diferent metrics, namely Precision, Recall, and F 1-score. From empirical
observations of the obtained results, we found that our model correctly classified most of the
relevant categories. The category-wise scores on the validation datasets are mentioned in Table 3.
From this, we observe that the model faces dificulties while learning for the “Mixed_feeling”
category, hence the score of this category is less as compared to other categories. In both datasets,
our model has been vastly confused in “Mixed_feelings”, and “unknown_state” categories. Most
of these categories predicted as “Positive” category in the validation datasets as shown in
confusion metrics (in Figure 2), appended in the Appendix section.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper shows that logistic regression with under-sampling technique achieved comparable
metric scores on the code mixed sentiment analysis dataset of Malayalam-English and
TamilEnglish. This technique is also relying on the word and character-level features, hence it
provides 0.71 and 0.62 as F1-scores on the blind test set of respective datasets. From empirical
observations of the obtained results on the validation set, we found that this technique correctly
classifies most of the relevant categories.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We are very thankful for the Google Colaboratory open-access server to perform these
experiments.
[7] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
and Computing for Under-Resourced Languages (CCURL), European Language Resources
association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/
2020.sltu-1.25.
[8] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
and Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
aclweb.org/anthology/2020.sltu-1.28.
[9] O. Habimana, Y. Li, R. Li, X. Gu, G. Yu, Sentiment analysis using deep learning approaches:
an overview, Science China Information Sciences 63 (2020) 1–36.
[10] S. Sharma, P. Srinivas, R. C. Balabantaray, Text normalization of code mix and sentiment
analysis, in: 2015 international conference on advances in computing, communications
and informatics (ICACCI), IEEE, 2015, pp. 1468–1473.
[11] A. Pravalika, V. Oza, N. Meghana, S. S. Kamath, Domain-specific sentiment analysis
approaches for code-mixed social network data, in: 2017 8th international conference on
computing, communication and networking technologies (ICCCNT), IEEE, 2017, pp. 1–6.
[12] S. Baccianella, A. Esuli, F. Sebastiani, SentiWordNet 3.0: An enhanced lexical resource for
sentiment analysis and opinion mining, in: Proceedings of the Seventh International
Conference on Language Resources and Evaluation (LREC’10), European Language Resources
Association (ELRA), Valletta, Malta, 2010. URL: http://www.lrec-conf.org/proceedings/
lrec2010/pdf/769_Paper.pdf.
[13] K. Sarkar, Ju_ks@ sail_codemixed-2017: Sentiment analysis for Indian code mixed social
media texts, arXiv preprint arXiv:1802.05737 (2018).
[14] G. Remmiya Devi, P. Veena, M. Anand Kumar, K. Soman, Amrita-cen@ fire 2016:
Codemix entity extraction for Hindi-English and Tamil-English tweets, in: CEUR workshop
proceedings, volume 1737, 2016, pp. 304–308.
[15] E. Kouloumpis, T. Wilson, J. Moore, Twitter sentiment analysis: The good the bad and the
omg!, in: Fifth International AAAI conference on weblogs and social media, Citeseer, 2011.
[16] P. Mishra, P. Danda, P. Dhakras, Code-mixed sentiment analysis using machine learning
and neural network approaches, arXiv preprint arXiv:1808.03299 (2018).
[17] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards sub-word level compositions
for sentiment analysis of Hindi-English code mixed text, in: Proceedings of COLING
2016, the 26th International Conference on Computational Linguistics: Technical Papers,
The COLING 2016 Organizing Committee, Osaka, Japan, 2016, pp. 2482–2491. URL: https:
//www.aclweb.org/anthology/C16-1234.
[18] S. Mukherjee, Deep learning technique for sentiment analysis of hindi-english code-mixed
text using late fusion of character and word features, in: 2019 IEEE 16th India Council
International Conference (INDICON), IEEE, 2019, pp. 1–4.
[19] I. TOMEK, Two modifications of CNN, IEEE Transactions on Systems, Man, and
Cybernetics SMC-6 (1976) 769–772.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Confusion Matrix</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Myers-Scotton</surname>
          </string-name>
          ,
          <article-title>Common and uncommon ground: Social and structural factors in codeswitching, Language in society (</article-title>
          <year>1993</year>
          )
          <fpage>475</fpage>
          -
          <lpage>503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
          <source>FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rudra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rijhwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Begum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Understanding language preference for expression of opinion and sentiment: What do Hindi-English speakers do on Twitter?</article-title>
          ,
          <source>in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Austin, Texas,
          <year>2016</year>
          , pp.
          <fpage>1131</fpage>
          -
          <lpage>1141</lpage>
          . URL: https://www.aclweb.org/anthology/D16-1121. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>D16</fpage>
          -1121.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motlani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mamidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Shallow parsing pipeline - Hindi-English code-mixed social media text, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>1340</fpage>
          -
          <lpage>1345</lpage>
          . URL: https://www.aclweb.org/anthology/ N16-1159. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N16</fpage>
          -1159.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>