<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">TADS@Dravidian-CodeMix-FIRE2020: Sentiment Analysis on CodeMix Dravidian Language</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Deepesh</forename><surname>Sharma</surname></persName>
							<email>deepeshsharma2017@iiitkottayam.ac.in</email>
							<affiliation key="aff0">
								<orgName type="institution">IIIT Kottayam</orgName>
								<address>
									<settlement>Kerela</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Forum for Information Retrieval Evaluation</orgName>
								<address>
									<addrLine>December 16-20</addrLine>
									<postCode>2020</postCode>
									<settlement>Hyderabad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">TADS@Dravidian-CodeMix-FIRE2020: Sentiment Analysis on CodeMix Dravidian Language</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">BB8F19B63EE6DEB6A70703423742FC65</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Sentiment Analysis</term>
					<term>Dravidian language</term>
					<term>Text Classification</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Sentimental analysis on Social Media has received much attention in research recently. Social Media will be the biggest source of big data in the upcoming years. Hence, the sentiment analysis of social media contents very important to regularize it. The FIRE 2020 organizers provided participants with annotated data-sets containing comments on YouTube videos in Malayalam and Tamil(including codemixing). Approached the problem using classic machine learning algorithms for classification i.e. SVM, Perceptron, and Logistic classifier.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We are exploring the field of natural language processing, which is the broad study of how computers and machines can understand human to human communication and how texts are analyzed based on contextual information by machines.</p><p>Code-Mixing is a phenomenon where speakers switch between multiple languages in a single utterance <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Code-Mixing is common in multilingual countries such as India <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>. There is an increasing demand for sentiment analysis on social media texts which are largely code-mixed <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. Sentiment analysis is the interpretation and classification of emotions (positive, negative, and neutral) within text data using text analysis techniques. The machine learning model based on monolingual data fails on code-mixed data. As the usage of the internet growing amount of code-mix multilingual data is increasing. The mixing of the scripts in the code-mixing makes it even more complicated to use the model trained on monolingual corpora <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>. In this paper, I provide a classic Machine learning algorithms trained on code mixed multilingual data.</p><p>In this paper, we present a model which can be used to find the sentiment of a given text. We want to classify the text as 'Positive ', 'Negative ', 'Mixed feelings ', 'unknown state ', 'not-Tamil' 2. Data-set <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> Malayalam is one of the Dravidian languages spoken in the the southern region of India with nearly 38 million Malayalam speakers in India and other countries.  Tamil, is a Dravidian language natively spoken by the Tamil people of India and Sri Lanka. Tamil is the official language of the South Indian state of Tamil Nadu, as well as two sovereign states, Sri Lanka and Singapore. For this shared task, we have been provided with a new gold standard corpus by the organizers for sentiment analysis of code-mixed text in Dravidian languages (Malayalam-English and Tamil-English). The data-set consists of YouTube comments which are then marked as one of the following. ' Positive', 'Negative', 'Mixed feelings', 'unknown state', 'not-Tamil'.</p><p>The distribution of the dataset is below.</p><p>As we can see from the data the very skewed a simpler machine learning approach will be more generalized.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Task Description</head><p>This is a message-level polarity classification task <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>. Given a YouTube comment, systems have to classify it into positive, negative, neutral, mixed emotions, or not in the intended languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiment Setup</head><p>We experimented with broadly three kinds of classic systems -an SVM classifier, a logistic classifier, and a Perceptron. We used the sci-kit learn implementation of SVM, Logistic Regression, and Perceptron. Support Vector Machines are one of the most successful classic machine learning models used for various kinds of text classification tasks. Used logistic regression with a multi-class variable as 'ovr' for multi-class classification. Perceptron is a single layer neural network and a multi-layer perceptron is called Neural Networks. We used a grid search for finding the best parameters for SVM algorithms.</p><p>For text to vector conversion, we used sklearn CountVectorizer. The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. I trained models separately for the Tamil dataset and the Malayalam dataset.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results Analysis</head><p>This section presents the results of the evaluation of the three architectures. We compare the performance of the above machine learning architectures to select submissions for each language. Classification Accuracy is what we usually mean when we use the term accuracy. It is the ratio of the number of correct predictions to the total number of input samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>SVM Logistic Reg Perceptron Accuracy Score 0.63 0.677 0.614</p><p>For further analysis, I used the confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC Curve.</p><p>In confusion matrix figure <ref type="figure" target="#fig_2">3</ref> and figure <ref type="figure" target="#fig_3">4</ref>, we can see that due to an unbalanced data-set many test cases were classified as negative 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this paper, we have described how I trained machine learning algorithms for classification. Simple machine learning algorithms were fast to train and set the base for further research. For, future work we can train complex deep learning algorithms but we will need a more balanced dataset for complex deep learning algorithms.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Malayalam data.</figDesc><graphic coords="2,89.29,85.88,166.68,127.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Tamil data.</figDesc><graphic coords="2,339.31,84.19,166.68,128.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Malayalam test confusion matrix.</figDesc><graphic coords="3,89.29,84.19,166.68,126.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Tamil test confusion matrix.</figDesc><graphic coords="3,339.31,84.19,166.68,126.60" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A survey of current datasets for code-switching research</title>
		<author>
			<persName><forename type="first">N</forename><surname>Jose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Suryawanshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sherly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), IEEE</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="136" to="141" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Named entity recognition for code-mixed indian corpus using meta embedding</title>
		<author>
			<persName><forename type="first">R</forename><surname>Priyadharshini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vegupatti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), IEEE</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="68" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Priyadharshini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stearns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jayapal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arcan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zarrouk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W19-6809" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, European Association for Machine Translation</title>
				<meeting>the 2nd Workshop on Technologies for MT of Low Resource Languages, European Association for Machine Translation<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="56" to="63" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">WordNet gloss translation for under-resourced languages using multilingual neural machine translation</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arcan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W19-7101" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, European Association for Machine Translation</title>
				<meeting>the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, European Association for Machine Translation<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A sentiment analysis dataset for code-mixed Malayalam-English</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Jose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Suryawanshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sherly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/2020.sltu-1.25" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</title>
				<meeting>the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association<address><addrLine>Marseille, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="177" to="184" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Corpus creation for sentiment analysis in code-mixed Tamil-English text</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Muralidaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Priyadharshini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/2020.sltu-1.28" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</title>
				<meeting>the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association<address><addrLine>Marseille, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="202" to="210" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Comparison of different orthographies for machine translation of under-resourced dravidian languages</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arcan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2nd Conference on Language, Data and Knowledge (LDK 2019)</title>
				<imprint>
			<publisher>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Leveraging orthographic information to improve machine translation of under-resourced languages</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>NUI Galway</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arcan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2008.01391</idno>
		<title level="m">A survey of orthographic information in machine translation</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Priyadharshini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Muralidaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Suryawanshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Jose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sherly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
		<ptr target="org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of the Forum for Information Retrieval Evaluation</title>
				<meeting><address><addrLine>FIRE; Hyderabad, India</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
	<note>CEUR-WS.</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chakravarthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Priyadharshini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Muralidaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Suryawanshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Jose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sherly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Mccrae</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE &apos;20</title>
				<meeting>the 12th Forum for Information Retrieval Evaluation, FIRE &apos;20</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
