<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Profiling Hate Spreaders using word N-grams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Alcañiz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Andrés</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>With the rise of social media over the last decade, the amount of content that is published every day on the internet has become huge. Unfortunately, as the amount of published content grows, the amount of hate speech that can be found on social media also grows. This fact motivates the creation of systems that could automatically detect this undesired behaviors in order to report them to the competent authorities. With this purpose, we have developed a system that detect users which could be considered as hate spreaders employing a TF-IDF vectorizer in combination with an SVM, achieving an accuracy of 81% over the Spanish dataset and 69% over the English dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Hate speech is commonly defined as a propaganda of ideas based on the superiority of a
group of people because of their race, color or ethnic origin. This problem is not novel, as
it has been present in our society during centuries, but due to the rise of social media it has
reached unprecedented levels. Given the huge amount of content generated by the users and
the impossibility to manually check all the content, the automatic detection of hate speech has
become a relevant task.</p>
      <p>With this purpose, the aim of this competition is to automatically detect hate speech, but
employing an author profiling perspective instead of a tweet perspective. Therefore, we are
interested in detecting which users could be considered as hate spreaders.</p>
      <p>
        This paper presents our participation in the Author Profiling task at PAN [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for detecting hate
spreaders [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our method follows the ideas presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which were focused on employing
n-grams of chars and words as features and an SVM as classifier. Moreover, we have tried
different classifiers and we have compared the obtained accuracies between them. The rest of this
paper is structured as follows: Section 2 describes the dataset used for this shared task, Section
3 presents the preprocessing that we have applied for each language, Section 4 presents our
approach to the problem, the results obtained per each model and a discussion of them and finally
Section 5 summarizes the paper and proposes possible future works.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Corpus</title>
      <p>The dataset of this competition was composed by 200 authors for each language, where each
author was composed by 200 tweets. From the 200 authors, 100 were hate spreaders and the
other 100 were not. Moreover, we would like to remark that all the urls, links, hashtags and user
mentions on the corpus were masked by a unique token for each type.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preprocessing</title>
      <p>The objective of this step is to reduce the vocabulary size by merging different token occurrences
that are referring to the same concept. For the preprocessing of the dataset we have followed
these steps.</p>
      <p>In first place, as we are interested in detecting hate speech at author level, we have concatenated
all the tweets of each author into one single string. Then, we have converted the text to lowercase
and we have replaced all the emojis and emoticons by the token “emoji” employing a regular
expression. After that, we have applied a different linguistic preprocessing for each language: in
the English dataset, we have replaced some contractions by their expanded form, (for example,
the token: “you’d” has been replaced by “you would”, the token “it’s” has been replaced by “it
is”, etc), meanwhile in the Spanish dataset we have replaced some words by their homologous
colloquial tokens (for instance, the token “por” has been replaced by “x” and the token “que” has
been replaced by “k”). Then, we have removed all the punctuation signs such as points, commas,
exclamation signs, etc. Moreover, we have reduced different forms of expressing laugh such as
“hahahah”, “ahahha”,“jajajaja”, “lol”,“lmao” to the token “haha”. Finally, we have removed the
stopwords and performed stemming on both datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Our Approach</title>
      <p>In this task we have considered the feature extraction process and the machine learning model
estimation as a combined optimization process. Therefore, we have performed an extensive
grid search to choose the best combination of hyper-parameters for the tfidf vectorizer and the
different machine learning models employed. To assess the performance of our classifier, we
have performed a 10-fold cross validation over the training dataset.</p>
      <p>
        For feature extraction, we have employed a TF-IDF [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] vectorizer, which allows us to quantify
the importance of every sequence of terms present in the corpus, multiplying the term frequency
in the text by the inverse document frequency of the term in the corpus. The hyper-parameters
of the Vectorizer are the following: “analyzer”, which denotes the level at which the feature
extraction is performed (either word level or character level), “ngram_range”, which denotes
the order of the employed language model and “min_df”, which removes those n-grams whose
document frequencies are lower than a given threshold.
      </p>
      <p>
        Before starting to discuss the selected classifiers, we would like to remark that we have decided
to avoid employing deep learning for this task. This is due to the fact that we only have 200
samples for each author, and given that deep learning is usually data-hungry, it could easily turn
into overfit. Therefore, among all the possible machine learning models, we have chosen the
following: logistic regression (LR) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Naive Bayes (NB)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Support Vector Machine (SVM)
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Random Forest (RF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], multiple linear models trained with Stochastic Gradient Descent
(SGD) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and K-nearest neighbor (KNN) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
• Logistic regression: Basic algorithm for binary classification, equivalent to “Linear
regression” but taking a logit function. The hyper-parameters taken for this model are,
“penalty” of L2, with a “solver” liblinear and the regularization coefficient “C”.
• Naive Bayes: A well-known technique which has been employed for tackling many
information retrieval problems. For this model, the hyper-parameters used are the smoothing
parameter “alpha” and “fit_prior”, which denotes if the model wants to learn the prior of
every class in the model.
• Support Vector Machine: A linear classification model that employs as decision boundary
the maximal margin hyperplane. This fact is relevant to our task, given that due to the
moderate size of the dataset and the large number of extracted features, many possible
decision boundaries exist. Moreover, this model also allows solving non-linear problems by
applying the appropriate kernel function. The tuned hyper-parameters are the regularization
coefficient “C” and the employed kernel.
• Random Forest: An ensemble model of multiple decision trees. The tuned
hyperparameters are “criterion”, to measure the quality of every split, and “min_samples_leaf”,
which denotes the number of samples required to be a leaf node.
• Stochastic Gradient Descent classifier: An optimization technique that allows us to fit
linear classifiers employing gradient descent. The hyper-parameters selected for this model
are the following: “‘loss” criteria, where each one of the losses represent a linear classifier
(for example, having as loss “log” will result in a logistic regression model with SGD
training). In this task we have used an L2 “penalty” and an “‘alpha” to regularize terms.
• K-Nearest Neighbor: A well known non-parametric classifier where the class for each test
sample is computed from a simple majority vote of the K nearest neighbors of each point.
The tuned hyper-parameters for this model are the “weights”, which could be uniform (each
point is weighted equally) or proportional to the distance from their neighbors. Among all
the possible distance metrics, we have tried the following: Euclidean distance, Manhattan
distance and Minkowski distance. Finally, we have also tuned the number of neighbors
“n_neighbors” to consider during the voting and the “leaf_size” of each branch.
      </p>
      <p>
        Finally, we would like to remark that we have used scikit-learn [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as toolkit for the employed
machine learning models.
      </p>
      <p>If we take a look at Tab. 2, it can be seen that linear models such as SVM, logistic regression
and SGD classifier have performed particularly well at this task. This is due to the fact that the
number of features is much larger than the number of samples, making it feasible to separate the
two classes linearly. From these models, the best performing one has been the SVM, achieving
an 80.90% of accuracy for the Spanish dataset and a 70.50% of accuracy for the English dataset.</p>
      <p>LR
NB
RF
KNN
SGD
SVM</p>
      <p>EN
ES
EN
ES
EN
ES
EN
ES
EN
ES
EN
ES</p>
      <p>C: 100</p>
      <p>C: 100
alpha: 0.25, prior: False
alpha: 0.25, prior: True
criterion:“gini”, depth: 4, min_samples: 10
criterion:“gini”, depth: 4, min_samples: 8
weights:“distance”, metric: “euclidean”,</p>
      <p>neighbors: 5, leaf_size=20
weights:“distance”, metric: “euclidean”,
neighbors: 10, leaf_size=20
alpha:0.01, loss:“perceptron”
alpha:0.001, loss:“hinge”</p>
      <sec id="sec-4-1">
        <title>C: 0.1, kernel: “linear”</title>
      </sec>
      <sec id="sec-4-2">
        <title>C: 1, kernel: “linear”</title>
        <p>Tf-idfVect Hyper-params</p>
        <p>word,
unigram and bigrams,
min_df: 11</p>
        <p>word,
unigram and bigrams,
min_df: 8</p>
        <p>word,
unigram and bigrams,
min_df: 8</p>
        <p>word,
unigram and bigrams,
min_df: 8</p>
        <p>word,
bigrams,
min_df: 9</p>
        <p>word,
unigram and bigrams,
min_df: 9</p>
        <p>word
unigrams,
min_df: 4</p>
        <p>word
unigrams,
min_df: 10</p>
        <p>Acc. (%)
68.50
80.50
65.55
79.00
67.00
80.50
65.00
77.00
70.00
80.00
70.50
80.90
Again, this is motivated by the fact that the SVM chooses the separating hyperplane with the
maximal margin from among all the possible separating hyperplanes. Other techniques such as
Random Forest, Naïve Bayes classifier and K-NN also performed well, but they did not achieve
the results obtained by the linear models.</p>
        <p>As final model, we have chosen an SVM for both datasets, given that it is the classifier which
has achieved the highest estimated accuracy in both languages. Finally, we have trained an SVM
for each language, employing the full train dataset and the hyper-parameters described in Tab. 2.
The results obtained in the competition with these models are the following:</p>
        <p>
          It can be seen that our estimation of the performance of the system has worked reasonably
well, matching the test accuracy very similar levels to the ones predicted during training. Test
accuracy results have been provided by the TIRA evaluation platform [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>To sum up, we have described the methods employed for this task. We have detailed how our
whole system works, from the preprocessing step to the estimation of the best hyper-parameters
for the feature extractor and the machine learning models. We have also seen that our estimation
of the accuracy employing a 10 fold CV is consistent with the test results. As future works, we
would like to test an ensemble model of different classifiers to see if it can beat the performance
achieved by our SVM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. D. L. P.</given-names>
            <surname>Sarracén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          , I. Markov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolska</surname>
          </string-name>
          , , E. Zangerle, Overview of PAN 2021:
          <article-title>Authorship Verification,Profiling Hate Speech Spreaders on Twitter,and Style Change Detection</article-title>
          ,
          <source>in: 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ), Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. D. L. P.</given-names>
            <surname>Sarracén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          , E. Fersini,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <source>Profiling Hate Speech Spreaders on Twitter Task at PAN</source>
          <year>2021</year>
          ,
          <article-title>in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pizarro</surname>
          </string-name>
          ,
          <article-title>Using n-grams to detect fake news spreaders on twitter</article-title>
          ,
          <source>in: CLEF</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>A statistical interpretation of term specificity and its application in retrieval</article-title>
          ,
          <source>Journal of documentation</source>
          (
          <year>1972</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pearl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <article-title>On the rate of growth of the population of the united states since 1790 and its mathematical representation</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          <volume>6</volume>
          (
          <year>1920</year>
          )
          <fpage>275</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Maron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Kuhns</surname>
          </string-name>
          ,
          <article-title>On relevance, probabilistic indexing and information retrieval</article-title>
          ,
          <source>Journal of the ACM (JACM) 7</source>
          (
          <year>1960</year>
          )
          <fpage>216</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Support-vector networks</article-title>
          ,
          <source>Machine learning 20</source>
          (
          <year>1995</year>
          )
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Random forests,
          <source>Machine learning 45</source>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <article-title>Online learning and stochastic approximations, On-line learning in neural networks 17 (</article-title>
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Silverman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Jones</surname>
          </string-name>
          , E. fix and jl hodges (
          <year>1951</year>
          )
          <article-title>: An important contribution to nonparametric discriminant analysis and density estimation: Commentary on fix and hodges (</article-title>
          <year>1951</year>
          ), International Statistical Review/Revue Internationale de Statistique (
          <year>1989</year>
          )
          <fpage>233</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>the Journal of machine Learning research 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , TIRA Integrated Research Architecture, in: N.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Peters (Eds.),
          <source>Information Retrieval Evaluation in a Changing World, The Information Retrieval Series</source>
          , Springer, Berlin Heidelberg New York,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>030</fpage>
          -22948-1\_5.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>