<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Detection of Aggressive Tweets in Mexican Spanish Using Multiple Features with Parameter Optimization</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Germán</forename><surname>Ortiz</surname></persName>
							<email>jortizb@iingen.unam.mx</email>
							<affiliation key="aff0">
								<orgName type="department">Instituto de Ingeniería</orgName>
								<orgName type="institution">Universidad Nacional Autónoma de México</orgName>
								<address>
									<country key="MX">México</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Helena</forename><surname>Gómez-Adorno</surname></persName>
							<email>helena.gomez@iimas.unam.mx</email>
							<affiliation key="aff1">
								<orgName type="department">Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas</orgName>
								<orgName type="institution">Universidad Nacional Autónoma de México</orgName>
								<address>
									<country key="MX">México</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jorge</forename><surname>Reyes-Magaña</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Instituto de Ingeniería</orgName>
								<orgName type="institution">Universidad Nacional Autónoma de México</orgName>
								<address>
									<country key="MX">México</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Universidad Autónoma de Yucatán</orgName>
								<address>
									<settlement>Mérida</settlement>
									<region>Yucatán</region>
									<country key="MX">México</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gemma</forename><surname>Bel-Enguix</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Instituto de Ingeniería</orgName>
								<orgName type="institution">Universidad Nacional Autónoma de México</orgName>
								<address>
									<country key="MX">México</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gerardo</forename><surname>Sierra</surname></persName>
							<email>gsierram@iingen.unam.mx</email>
							<affiliation key="aff0">
								<orgName type="department">Instituto de Ingeniería</orgName>
								<orgName type="institution">Universidad Nacional Autónoma de México</orgName>
								<address>
									<country key="MX">México</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Detection of Aggressive Tweets in Mexican Spanish Using Multiple Features with Parameter Optimization</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8367C9DAF928F574F582032C5ABF3178</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Aggressiveness detection</term>
					<term>Support Vector Machine</term>
					<term>Machine learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper explains our approach to Aggressiveness Identification in the MEX-A3T shared task, whose aim is the detection of aggressive tweets. The task proposes a binary classification for every tweet: aggressive and non-aggressive. We approached the problem using linguistically motivated features and several types of n-grams (words, characters, functional words, punctuation symbols, among others). We trained a Support Vector Machine using a combinatorial framework that optimizes the results of the classifier. Our best run achieved a F1-score of 0,4549, which is the 5 th best among the twenty-six runs.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Aggressiveness is an emotional state that consists of hate feelings and desires to physically or psychologically hurt a person or group of people. Nowadays, communication through social networks plays a crucial role in society life. Social Networking Services open a whole world of possibilities, but they also represent a significant threat, since users are exposed to many risks and attacks; among them aggressive comments, which can cause short-term and long-term damage to victims.</p><p>For the second year in a raw, the MEX-A3T 2019 workshop <ref type="bibr" target="#b1">[2]</ref> launched the aggressiveness detection track in Mexican Spanish tweets with the aim of promoting research on the analysis of the content of social networks in this language. For this task, the organizers define an aggressive tweet as follows: it contains messages that despise or humiliate a person or group of people, using the following elements: nicknames, jokes or derogatory adjectives. Our approach uses a Machine Learning perspective in which the problem results in a binary classification, between aggressive or not. To do this, we use the Support Vector Machine (SVM) algorithm as a classifier. For feature extraction, different types of n-grams were used (n-grams of words, n-grams of characters, skipgrams, among others).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>In recent years, the automatic detection of aggressive behavior in social media is gaining a lot of attention.</p><p>Our approach is based on previous work on hate speech detection in twitter <ref type="bibr" target="#b2">[3]</ref> and aggressive detection of tweets in Mexican Spanish <ref type="bibr" target="#b5">[6]</ref>, which were presented in the MEX-A3T 2018 Workshop <ref type="bibr" target="#b0">[1]</ref>, and the SemEval-2019 Workshop, respectively. The former follows a classical machine learning approach, in which a logistic regression algorithm is trained on linguistically motivated characteristics and various types of n-grams. The latter uses a Support Vector machine as classifier with a combinatorial framework for parameter optimization.</p><p>Concerning to aggressiveness detection related work, <ref type="bibr" target="#b7">[8]</ref> classifies Facebook comments using three deep learning architectures, Convolutional Neural Networks, Long Short Term Memory networks, and Bi-directional Long Short Term Memory networks and a majority voting-based ensemble method to combine them.</p><p>Djuric et al. <ref type="bibr" target="#b4">[5]</ref> used the generated list to annotate a publicly available corpus of more than 16k tweets. They analyzed the impact of various extra-linguistic features along with character n-grams for the detection of hate speech. In turn, they elaborated a dictionary based on the most indicative words in their data.</p><p>Chatzakou et. al <ref type="bibr" target="#b3">[4]</ref> studied the properties of bullies and aggressors, and found that stalkers post with less frequency, participate in fewer online communities and are less popular than users with standard models of behaviour. Their research shows that machine learning classification algorithms can accurately detect users who exhibit bullying and aggressive behavior, with more than 90% of accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Corpus</head><p>The corpus was collected between August and November 2017. The training dataset has 7700 tweets, with a distribution of 35% of aggressive messages and 65% non-aggressive messages, keeping the texts and labels on separate files.</p><p>Aggressive tweets contain at least one word considered vulgar or insulting based on a Mexicanisms dictionary. The dataset was manually labeled by two taggers following the premise that an aggressive message pretends to humiliate a person or people with jokes or derogative adjectives.</p><p>In the corpus, all user handlers were replaced by @USUARIO and all URL's were replaced by &lt;URL&gt;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Methodology</head><p>This section shows in detail the processing that was carried out in the corpus to subsequently perform the classification task. This is a very important stage to maximize the classifier performance, as well as allow to manipulate the data in a simplified way. In addition, text representation features are described.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Pre-processing</head><p>-Diacritic symbols: These were removed to avoid composed symbols, that are an errors source in informal texts. -Text normalization: Tweets were standardized to lowercase to avoid multiple copies of the same words along the corpus. -Abbreviations replacement: Abbreviations, contractions and slangs were replaced by the original text using a social networks-based dictionary <ref type="bibr" target="#b6">[7]</ref>. -Emojis were removed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Classifier</head><p>We used a combinatorial framework (µT C) developed by <ref type="bibr" target="#b8">[9]</ref>. The framework approaches any text classification task as a combinatorial optimization problem; where there is a search space containing all possible combinations of different text pre-processing methods, text features and weighting schemes with their respective parameters, and, on this search space, a local search-based metaheuristic is used to search for a configuration that produces a highly effective text classifier. Considering all the combinations established in the implementation of (µT C), we optimized the features described in Section 4.3. Once the best configuration was selected, we trained an SVM with a linear kernel.</p><p>Different from previous work <ref type="bibr" target="#b2">[3]</ref> where the features added to (µT C) are static, that is, the feature sets that are not considered in the (µT C) framework were selected based on their individual classification performance. Once the best configuration space was found, all n-grams types with all n variations are added to the final vector for each text. In our approach, all features were included and optimized in the (µT C) framework, which selects only those feature sets that are likely to offer the best classifier performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Features</head><p>Beside the features already considered in the µT C framework we took into account other features such as the one mentioned bellow:</p><p>-Character n-grams: They are language-independent powerful features for many natural language processing tasks where many words are likely to be poor written. For our approach a variation of n from 3 to 5 is used. -Word n-grams: These features capture the identification of a word and its possible neighbors. We use a variation of n from 3 to 5. -Aggressive words n-grams: In our approach we manually collected an aggressive words lexicon obtained from the web and some word extracted from the training corpus. Variation of n from 2 to 3 is used. -Skipgrams: For our approach we capture 2-words groups with skips from 2 to 4 words. -Stopwords n-grams: We use the stopwords list from NLTK library to build them, with a variation of n from 2 to 4. Stopwords frequencies are one of the best features to detect aggressiveness messages. -Punctuation-symbols n-grams: These n-grams helps to detect patterns in aggressiveness analysis. We use a variation of n from 2 to 5 to build them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>The system performance in the aggressiveness detection track was measured using F1-score on aggressive class. Table <ref type="table">5</ref> shows results for the best run on the training corpus with 10-fold cross-validation using static and optimized features (as we describe above), along with the evaluation phase official results on the test corpus. In the final configuration space, besides the features already considered in (µT C), from our additional feature sets just punctuation symbols n-grams were used with n = 5, while the other features are ignored.</p><p>The results we obtained in the 2019 edition were clearly better than the ones from 2018. We improved our results from 0.4285 to 0.4549. The main difference was the use of (µT C) <ref type="bibr" target="#b8">[9]</ref>. This means that this combinatorial framework is a good complement that helps to optimize the feature set for the classification process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>We presented an approach for aggressiveness detection in Mexican Spanish tweets.</p><p>We trained a Support Vector Machine using a combinatorial framework (µT C), to which we added different types of n-grams such as punctuation symbols n-grams, stop-words n-grams, and aggressive words n-grams to be optimized. The results we obtained are better than the ones obtained last year, achieving an improvement from 0.4285 to 0.4549 on the F1-score on the aggressive class.</p><p>In addition, the obtained results in this task were improved by the optimization of extra features added in previous work <ref type="bibr" target="#b2">[3]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Results of Aggressiveness detection task of training phase and the evaluation phase (Eval) official results</figDesc><table><row><cell cols="2">Position Team</cell><cell cols="2">Training Eval</cell></row><row><cell>1</cell><cell>INGEOTEC</cell><cell>-</cell><cell>0.4796</cell></row><row><cell>2</cell><cell>Casavantes</cell><cell>-</cell><cell>0.4790</cell></row><row><cell>3</cell><cell>GLP</cell><cell>-</cell><cell>0.4749</cell></row><row><cell>4</cell><cell>mineriaUNAM (optimized)</cell><cell cols="2">0.7438 0.4549</cell></row><row><cell>6</cell><cell>mineriaUNAM (static)</cell><cell cols="2">0.7433 0,4516</cell></row><row><cell>7</cell><cell>LyR</cell><cell>-</cell><cell>0.4288</cell></row><row><cell>8</cell><cell>Victor</cell><cell>-</cell><cell>0.4081</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of mex-a3t at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Á</forename><surname>Álvarez-Carmona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Guzmán-Falcón</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y-Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Escalante</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Villaseñor-Pineda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Reyes-Meza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rico-Sulayes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Notebook Papers of 3 rd. SEPLN Workshop on Evaluation of Human Language Technologies for Iberian Languages (IBEREVAL)</title>
				<meeting><address><addrLine>Seville, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-09">September (2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of mex-a3t at iberlef 2019: authorship and aggressiveness analysis in mexican spanish tweets</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Aragón</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Á</forename><surname>Álvarez-Carmona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Escalante</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Villaseñor-Pineda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moctezuma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Notebook Papers of 1 st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</title>
				<meeting><address><addrLine>Bilbao, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-09">September (2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">MineriaUNAM at SemEval-2019 Task 5: Detecting Hate Speech in Twitter using Multiple Features in a Combinatorial Framework</title>
		<author>
			<persName><forename type="first">L</forename><surname>Argota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reyes-Magaa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bel-Enguix</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 13 th International Workshop on Semantic Evaluation (SemEval-2019</title>
				<meeting>the 13 th International Workshop on Semantic Evaluation (SemEval-2019</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Mean birds: Detecting aggression and bullying on twitter</title>
		<author>
			<persName><forename type="first">D</forename><surname>Chatzakou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kourtellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Blackburn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>De Cristofaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Stringhini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vakali</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 ACM on web science conference</title>
				<meeting>the 2017 ACM on web science conference</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="13" to="22" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Hate speech detection with comment embeddings</title>
		<author>
			<persName><forename type="first">N</forename><surname>Djuric</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Morris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grbovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Radosavljevic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bhamidipati</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24 th international conference on world wide web</title>
				<meeting>the 24 th international conference on world wide web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="29" to="30" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A machine learning approach for detecting aggressive tweets in spanish</title>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bel-Enguix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sierra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Sánchez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Quezada</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3 rd Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)</title>
		<title level="s">CEUR WS Proceedings</title>
		<meeting>the 3 rd Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Improving feature representation based on a neural network for author profiling in social media texts</title>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sidorov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Posadas-Durán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Sanchez-Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chanona-Hernandez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational intelligence and neuroscience</title>
		<imprint>
			<biblScope unit="page">2</biblScope>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Aggression detection in social media using deep neural networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Madisetty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sankar-Desarkar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying</title>
				<meeting>the First Workshop on Trolling, Aggression and Cyberbullying</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">An automated text categorization framework based on hyperparameter optimization</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Tellez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moctezuma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Miranda-Jiménez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Graff</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.knosys.2018.03.003</idno>
		<ptr target="https://doi.org/10.1016/j.knosys.2018.03.003" />
	</analytic>
	<monogr>
		<title level="j">Knowledge-Based Systems</title>
		<imprint>
			<biblScope unit="volume">149</biblScope>
			<biblScope unit="page" from="110" to="123" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
