<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Hate Speech Detection on Twitter Notebook for PAN at CLEF 2021</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Carolina</forename><surname>Martín-Del-Campo-Rodríguez</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Instituto Politécnico Nacional (IPN)</orgName>
								<orgName type="department" key="dep2">Centro de Investigación en Computación (CIC)</orgName>
								<address>
									<addrLine>Juan de Dios Bátiz Avenue</addrLine>
									<postCode>07738</postCode>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Grigori</forename><surname>Sidorov</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Instituto Politécnico Nacional (IPN)</orgName>
								<orgName type="department" key="dep2">Centro de Investigación en Computación (CIC)</orgName>
								<address>
									<addrLine>Juan de Dios Bátiz Avenue</addrLine>
									<postCode>07738</postCode>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ildar</forename><surname>Batyrshin</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Instituto Politécnico Nacional (IPN)</orgName>
								<orgName type="department" key="dep2">Centro de Investigación en Computación (CIC)</orgName>
								<address>
									<addrLine>Juan de Dios Bátiz Avenue</addrLine>
									<postCode>07738</postCode>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Hate Speech Detection on Twitter Notebook for PAN at CLEF 2021</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">F3A697EAA9C065AFDC14756E01686117</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:44+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Hate speech</term>
					<term>Twitter</term>
					<term>SVM</term>
					<term>Deep Neural Network</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the use of social networks, the automatic detection of hate speech has become of great importance to prevent people, being protected by anonymity, from feeling free to discriminate against different groups. This document describes two approaches taken to detect hate speech by author: the first based on the individual processing of tweets by the author, which establishes a threshold of hate tweets to identify hate speech; the second based in the concatenation of tweets by author for processing.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The use of social networks has been increasing in recent years and the presence of hate speech has increased in the same way; so the study of this topic has gained attention. In <ref type="bibr" target="#b0">[1]</ref> different types of hate speech are pointed out. But, even with these categories, the task of hate speech detection is not easy, even for human, because this is intrinsically associated to relationships between groups, and also rely on language nuances <ref type="bibr" target="#b1">[2]</ref>. Hence, this task becomes even more difficult for computers, even though various approaches have been taken to try to solve this problem.</p><p>In <ref type="bibr" target="#b2">[3]</ref> authors focus in detection of Misogynistic Language on Twitter. They created five sub-categories to detect the phenomena: Discredit, Stereotype and Objectification, Sexual Harassment and Threats of Violence, Dominance, and Derailing. For the final model they used a token n-grams representation and SVM for classification.</p><p>In this paper we describe our approach to detect hate speech authors on Twitter applying Support Vector Machine and Deep Neural Networks, participating in the PAN 2021 <ref type="bibr" target="#b3">[4]</ref> task "Profiling Hate Speech Spreaders on Twitter" <ref type="bibr" target="#b4">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Corpus Description</head><p>The corpus for the development phase was divided into two languages (English and Spanish), each language was made up of 200 authors, with 200 tweets each. The labels were at the author CLEF 2021 -Conference and Labs of the Evaluation Forum, September 21-24, 2021, Bucharest, Romania cm.del.cr@gmail.com (C. Martín-del-Campo-Rodríguez); sidorov@cic.ipn.mx (G. Sidorov); batyr1@gmail.com (I. Batyrshin) level, that is, there were no tags for each tweet. For the test phase, for each language there were 100 authors with 200 tweets each.</p><p>Links, user mentions, and hashtags were substitute, for all tweets, with #URL#, #USER# and #HASHTAG#, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data Preprocessing</head><p>The pre-processing steps for both languages were the same. Links, user mentions, hashtags, and retweets were removed. All non-English and non-Spanish characters, respectively, were removed. The occurrence of more than two consecutive characters (letters) was replaced by only two characters. All punctuation was remove and all numbers were substituted with 0. Emojis (emoticons) were decoded into their text equivalents. Then all the text was lowercase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>Different approaches were followed for each language. For English, a manual labeling of the tweets was carried out, so each tweet was considered individually for the training. For Spanish, the original author labeling was used, concatenating all the tweets by author separated by a special token. Each approach is detailed in the next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">English</head><p>A manual labeling of tweets from 45 authors was made by us, based in the original labeling: 23 authors labeled as hate speech and 22 authors as non-hate speech. From these tweets, all tweets labeled as hate speech were taken, 405 tweets; from the rest of the tweets, the same number of tweets was randomly selected in order to have a balanced training corpus. GloVe was used for the embeddings.</p><p>A deep neural network was used for the classification 1 , the architecture is defined in figure <ref type="figure" target="#fig_0">1</ref>. The activation function used for the inner dense layers was relu, for the last dense layer sigmoid was used. To avoid overfitting, two Dropouts 0.8 and 0.5 were used, as can be seen in Figure <ref type="figure" target="#fig_0">1</ref>. For the inner dense layers, a kernel regularizer l2 with a value of 0.0001 was set. For the last Dense layer a l2 activity regularizer was set with the same value. The model was configured with binary crossentropy for the loss, an Adam optimizer and using accuracy as metric.</p><p>A 10-Fold Cross-Validation was used in the training. Once the training was finished, the rest of the tweets in the corpus were classified. Next, the sum of Hate Speech Tweets (HST) by author was performed. Taking into account the original labeling (by author) and the HST number, a threshold was defined i.e., if an author had more than a certain HST number, this author was classified as a hate speech spreaders. The threshold was set to 30. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Spanish</head><p>All tweet per author were concatenated, using the token &lt;EOT&gt; as separation. For classification a SVM was used 2 . A linear kernel was used and a max number of iterations of 2000 was set (all the other parameters were left as default). For the training, 10-Fold Cross-Validation was used. Tokenization was perform with CountVectorizer, using the default values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>Accuracy was used as evaluation metric according to the specifications of the organizer of PAN 2021 and TIRA <ref type="bibr" target="#b5">[6]</ref>. For the training dataset the accuracy for English was 80.11% ± 3.41, for Spanish 79.12 ± 1.23, having around 79% of accuracy. For the testing phase, the result obtained for English was 65% and for Spanish was 77%, getting an average of 71%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>Using the manual tweets labeling approach not only flawed the criteria for detecting hate speech but also makes it impossible to recognize the full context by author thus, hate spreaders whose individual tweets do not represent hate speech, but in a general way constantly attack a specific group, cannot be identified with this approach. On the other hand, concatenate all the tweets allows to analyze the complete context by author, but makes processing difficult (200 tweets per author). The results obtained shows that the concatenate approach is better generalizing.</p><p>As future work, a more in-depth analysis using deep neural network with tweets concatenation and SVM with labeling of tweets is proposed. In the same way, using the concatenation approach analyzes how the use of mini batches of tweets by author is carried out, maintaining the author's labels, so that part of the context is preserved and processing is facilitated.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Specifications of the Neural Network model for English</figDesc><graphic coords="3,188.59,84.19,218.10,320.70" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The work was done with support of the Government of Mexico via CONACYT, SNI and the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Resources and benchmark corpora for hate speech detection: a systematic review</title>
		<author>
			<persName><forename type="first">F</forename><surname>Poletto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sanguinetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bosco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC 2020</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey on automatic detection of hate speech in text</title>
		<author>
			<persName><forename type="first">P</forename><surname>Fortuna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nunes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="page" from="1" to="30" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Automatic identification and classification of misogynistic language on twitter</title>
		<author>
			<persName><forename type="first">M</forename><surname>Anzovino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fersini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Natural Language Processing and Information Systems</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Silberztein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Atigui</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Kornyshova</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Métais</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Meziane</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="57" to="64" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Overview of PAN 2021: Authorship Verification,Profiling Hate Speech Spreaders on Twitter,and Style Change Detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bevendorff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chulvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">L D L P</forename><surname>Sarracén</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Manjavacas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mayerl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wiegmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wolska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zangerle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">12th International Conference of the CLEF Association (CLEF 2021)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Profiling Hate Speech Spreaders on Twitter Task at PAN 2021</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">L D L P</forename><surname>Sarracén</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chulvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fersini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">CLEF 2021 Labs and Workshops</title>
		<title level="s">Notebook Papers</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">TIRA Integrated Research Architecture</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wiegmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-22948-1_5</idno>
	</analytic>
	<monogr>
		<title level="m">Information Retrieval Evaluation in a Changing World, The Information Retrieval Series</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Peters</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
