<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Profiling Hate Speech Spreaders on Twitter using stylistic features and word embeddings Notebook for PAN at CLEF 2021</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucía Gómez-Zaragozá</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Hinojosa Pinto</string-name>
          <email>shinojosa@multiscan.eu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Multiscan Technologies S.L., Universitat Politècnica de València</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the different solutions proposed for the Profiling Hate Speech Spreaders on Twitter task at PAN 2021, which consists of classifying each author as hater or no hater from a set of tweets, for Spanish and English languages. The given approaches are different for each language. For Spanish, an ensemble of LSTM and a Logistic Regression model trained with stylistic features is used. For English, an ensemble of SVC and Random Forest model, also with stylistic features, is proposed. Our solutions achieved an accuracy of 83% in Spanish and 58% in English, resulting in an overall accuracy of 70.5% in the task ranking.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Hate speech</kwd>
        <kwd>author profiling</kwd>
        <kwd>natural language processing</kwd>
        <kwd>NLP</kwd>
        <kwd>embeddings</kwd>
        <kwd>LSTM</kwd>
        <kwd>Twitter</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automatic hate speech detection on social media has become a topic of growing interest in the
artificial intelligence community and particularly, in the area of Natural Language Processing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Although different definitions can be found in the literature, hate speech is commonly described as
language that attacks or disparages a person or a group based on specific characteristics that include,
among others, physical appearance, nationality, religion or sexual orientation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Given the huge
amount of user-generated content and the rapid dissemination of information these days, being able to
identify not isolated hate speech comments but hate speech spreaders is a key first step in trying to
prevent hate speech from spreading in online communications.
      </p>
      <p>
        This paper describes the proposed models for the PAN 2021 Profiling Hate Speech Spreaders on
Twitter [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is one of the three proposed tasks at CLEF 2021 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] deployed on TIRA platform [5].
The dataset provided in the shared task consisted of a balanced set of users that have shared some hate
speech tweets, labeled as haters and non-haters otherwise. It was provided in two languages, namely
Spanish and English. For each of them, the dataset it included 200 different users and 200 tweets per
user. As recommended by the shared task, we presented a different solution for each language. For the
Spanish dataset, an ensemble of LSTM and a logistic regression model trained with stylistic features is
proposed, which achieved 83% of accuracy in the provided test set. For the English dataset, an ensemble
of Support Vector Classification and Random Forest both based on stylistic features is presented, which
achieved 58% of accuracy in the provided test set.
      </p>
      <p>In Section 2 we present some related work on profiling hate speech spreaders. In Section 3 we
describe the two approaches proposed, including the description of the features used and the
implemented machine learning models. In Section 4 we present the experimental results achieved for
both languages independently. Finally, in Section 5, we present the conclusions and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Generic text mining features are commonly used for hate speech detection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These include several
types of characteristics, such as those obtained from dictionaries, bag-of-words (BOW), N-grams,
TFIDF, Part-of-speech (POS) or word embeddings. There are also specific features for hate speech
detection, but in some cases, they require additional user information (like gender, age or geographic
localization), or they focus on specific stereotypes. Regarding the algorithms used for hate speech
detection, which is typically considered as a binary classification (hate vs not-hate), the most common
are Support Vector Machines, followed by Random Forest, Decision Trees and Logistic Regression [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
More recent approaches use deep learning techniques, such as attention-based neural networks [6] or
an ensemble of neural networks [7], obtaining good performance results.
      </p>
      <p>In addition, the aim of this shared task is not only to detect hateful content, but profile hate speech
spreaders. In this sense, common features used in the field of author profiling are stylistic features (such
as frequency of punctuation marks, capital letters, word frequency), content features (such as BOW,
TF-IDF or N-grams), POS tags, readability features or emotional features (emotion words and
emoticons) [8,9]. Other message features such as retweets, hashtags, URLs and mentions are also
considered in this area and, recently, the word and character embeddings are also applied. Regarding
the algorithms used for author profiling, the traditional machine learning models are widely used, but
in the last few years, deep learning approaches such as Recurrent Neural Networks (RNN) and
Convolutional Neural Networks (CNN) have gained attention [10].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed models aim to discriminate hate speech spreaders from those who have never shared
hate speech content on Twitter. They were built as an ensemble of classifiers, using two different
approaches. On the one hand, stylistic features were extracted for each tweet and statistics per author
were obtained from them in order to apply classic machine learning algorithms. On the other hand, a
neural network with word embeddings was trained with the groups of tweets. Since no development set
was provided in the shared task, it was decided to randomly split the training set into two partitions:
90% of the users for the development set and 10% for the test set, each containing 180 and 20 users
respectively. These data partitions were used to evaluate the models using the official metric for this
task, the accuracy, and to compare their performance on the same unseen data. The best models were
then applied to the test data provided in the shared task, whose results were used to rank the performance
of our system. The following sub-sections describe the two different approaches.</p>
      <p>3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Word embeddings and LSTM</title>
      <p>In this approach, an aggregation of everyone's set of tweets was first performed in order to obtain
one text per subject. A preprocessing step was also applied in order to remove accents, capital letters,
double spaces and stop-words. Then, the development set was divided into two partitions: 60% of the
users for the training set and 40% of the users for the validation set, with 108 and 72 users respectively.</p>
      <p>First, a tokenizer with a selected number of maximum words was adjusted to the training set, so that
only the top words remained in the vocabulary, and the less used words were eliminated. The next step
was to convert texts into sequences, meaning that each word of the tweet was translated into the index
of that word in the vocabulary. The last step was to pad the sequences, so that they all had the same
length regardless of the number of words they originally had. Specifically, a maximum sequence length
of 1000 words was set, with the sequences being the collection of tweets from one subject, so that none
of them would be trimmed. The training of the word embeddings with the configured dimension is
performed simultaneously with the rest of the neural network parameters.</p>
      <p>Once the above steps have been completed, the training of the neural network can be performed,
setting the maximum number of words to be considered in the tokenizer and the word embedding
dimensions. The neural network architecture is based on an LSTM, as shown in Figure 1, and it was
trained using the categorial cross-entropy as the loss function.</p>
    </sec>
    <sec id="sec-5">
      <title>Stylistic features and classical machine learning algorithms</title>
      <p>To obtain the model that discriminates between a hater and a no hater, a set of stylistic features was
calculated for each tweet independently. These characteristics have been divided into three groups:
pattern-related, word-related and emoji-related features. The first group include, among others, the
number of occurrences of certain patterns in the texts (such as hashtags, URLs or retweets) and the
number of certain characters (such as symbols or letters). Word-related features include counts of
particular words, such as nouns, verbs or adjectives. Both feature sets were calculated using regular
expressions with the RegEx Python module [11] and the English model “en_core_web_sm” for de
English dataset and “es_core_news_sm” for the Spanish dataset from spaCy Python library [12] for
lemmatization and identification of word categories. Regarding the emojis, they were analyzed and
grouped following different categories from the advertools Python library [13]. The rate between the
unique emojis and the total emojis in the tweet was also included. The total set of 37 characteristics,
referred to here as handcrafted features, is shown in Table 1.</p>
      <p>Once the stylistic features were calculated for each tweet, four statistics (mean, standard deviation,
minimum and maximum) were computed for all the tweets of the same user. As result, a vector of 148
stylistic features was obtained for each author. Features were then standardized by subtracting the mean
and dividing by the standard deviation for the development set, and these values were then applied to
the test.</p>
      <p>As there were only 200 different users in the dataset, a feature reduction method was applied to
reduce the number of characteristics. First, the Pearson's correlation matrix was calculated, and
highcorrelated features (p &gt; 0.95) were eliminated. Then, a filter method was implemented to avoid
overfitting. It consisted of calculating the area under the ROC Curve for each characteristic and
removing those with values close to 0.5, which mean that they were not relevant for the classification
task. With this method 50% of the features were eliminated, remaining the features with more
information. Finally, sequential backward selection was applied to determine the optimal combination
of N features for classification in the range 10 to a threshold (T) of the maximum allowed
characteristics, which could be 15, 20 or 30 items, respectively. This selection method iteratively
computes a criterion function for a given machine learning classification algorithm using a
crossvalidation strategy. In each iteration, one feature is removed at a time to create n-1 subsets of features.
For each of them, a machine learning model is trained, and the criterion function for cross-validation is
recalculated. Based on these results, the feature associated with the best performing model is removed,
since removing it yielded the best result and therefore, is the one that helps the least in the classification.
This process, called feature ablation, is repeated until 10 features are left. In this work, we used accuracy
as the criterion function and the stratified K fold cross-validation with five folds as cross-validation
strategy.</p>
      <p>Regarding the machine learning classification algorithms, the following were chosen, Support
Vector Classification (SVC), K-Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest
(RF) and Decision Tree (DT). Each of these algorithms was applied sequentially in both sequential
backward selection with default hyperparameters and hyperparameter tuning. In the latter step, the same
cross-validation strategy was used as in the feature selection method, for the different hyperparameter
combinations shown in Table 2. Finally, the test set was transformed by keeping only the selected
features and applying the standardization with the training set statistics. The machine learning model
was then applied with the chosen hyperparameters and the predictions were obtained.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Experimental results</title>
      <p>The following sections summarize the results obtained with the different datasets, in Spanish and
English, and detail the final models chosen for each of them.</p>
      <p>4.1.</p>
    </sec>
    <sec id="sec-7">
      <title>Spanish dataset</title>
      <p>As mentioned above, two approaches were evaluated for the dataset. Firstly, the word embedding
described in Section 3.1 was trained using different combinations of parameters to obtain the best
configuration. Table 3 shows the accuracy results obtained in the test set by modifying the maximum
number of dictionary words to be tokenized between 1000, 2000, 3000 and 4000, keeping the
embedding dimensions constant.</p>
      <p>The experimentation conducted showed that the best performing network configuration consisted of
a maximum of 3000 words considered in the tokenizer and a 10-dimensional embedding, achieving
80% accuracy.</p>
      <p>Despite the good results, the methodology described in Section 3.2 was used to obtain a new hater
versus non-hater classifier based on stylistic features. The results are shown in Table 5, where the
machine learning model and the number of features used by each model (N-features) are indicated. It
also includes the following evaluation metrics: the cross-validation accuracy (CV-acc), the test accuracy
(Test-acc), the true positive rate and the true negative rate in the test set (Test-TPR and Test-TNR,
respectively). Only the models with the feature selection and hyperparameters that provided the best
results have been included, rather than all combinations tested.</p>
      <p>The highest accuracy was 80%, as in word embedding. This score was achieved with the logistic
regression, both in the development test and in the test set, using the features listed in Table 6.</p>
      <p>
        As a last step, since both approaches achieved high accuracies, an ensemble of the two best models
was built. The logistic regression and the word embedding scores were combined using the sum rule
with an alpha weight associated with the score of each approach. It is shown in the equation (1), where
  is the combined score,   is the score from the logistic regression,  
embedding and alpha is the weight in the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ].
      </p>
      <p>is the score from the word
  =  ·  
+ (1 −  ) ·  
(1)</p>
      <p>To find the best alpha estimate, values between 0 and 1 were tested in increments of 0.05 for the
development set. The alpha value that achieved the highest accuracy was 0.85, which reached 86%
accuracy on the development set.</p>
      <p>The ensemble of the regression model and word embedding was finally applied to the test set
provided in the task, achieving 83% accuracy.</p>
      <p>4.2.</p>
    </sec>
    <sec id="sec-8">
      <title>English dataset</title>
      <p>As with the Spanish tweets, word embeddings were first tested to solve the classification task for the
English dataset. The neural network was adapted to the dataset by modifying the maximum number
words to be considered in the tokenizer between 1000, 2000, 3000 and 4000, keeping the embedding
dimensions constant. The results are shown in Table 7.</p>
      <p>Although the results in Table 7 were not as expected, additional experiments were carried out by
varying the embedding dimensions for the best of the configuration found. The results are shown in
Table 8.</p>
      <p>The variation of the embedding dimensions also did not provide better results. Therefore, it was
decided not to continue in this direction and to focus on the second approach based on classical machine
learning classifiers.</p>
      <p>Following the pipeline described in Section 3.2, the results showed in Table 9 were achieved. The
table shows the machine learning model and the number of features used by each model (N-features).
It also includes the following evaluation metrics: the cross-validation accuracy (CV-acc), the test
accuracy (Test-acc), the true positive rate and the true negative rate in the test set (Test-TPR and
TestTNR, respectively).</p>
      <p>Pattern-related features</p>
      <p>• Mean retweet
Word-related features
• Std repeated words
• Std total words
Emojis-related features
• Std unique emojis
• Mean emoji face-affection
• Max emoji face-affection
• Mean emoji face-hand</p>
      <p>
        Since the classical machine learning models based on stylistic features obtained better results, it was
decided to create an ensemble of the two best models obtained. The final prediction was obtained by
combining the SVC and RF scores using the sum rule with an alpha weight associated with the score
of each model, as previously performed in the Spanish ensemble. The combination is shown in the
equation (2), where   is the combined score,   is the score from the SVC,   is the score from
the RF and alpha is the weight in the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ].
      </p>
      <p>=  ·  
+ (1 −  ) ·  
(2)
•
•</p>
    </sec>
    <sec id="sec-9">
      <title>5. Conclusions and future work</title>
      <p>This paper presented the proposed ensemble models for the PAN 2021 Profiling Hate Speech Spreaders
on Twitter shared task at CLEF 2021. The problem was addressed in two languages, namely Spanish
and English, and two approaches were presented for each of them, whose evaluations in the task ranking
are summarized in Table 12. For the Spanish dataset, an ensemble was created from a neural network
with word embeddings and a logistic regression. The first one was created with all the tweets grouped
by subject, whereas the second was based on statistic obtained from stylistic features computed for each
user’s tweet. This approach achieved 83% accuracy on the provided test set. Regarding English dataset,
an ensemble of a support vector classifier and a random forest, both based on statistics of stylistic
features, achieved 58% accuracy on the provided test set.</p>
      <p>Overall, the results showed that stylistic characteristics are important features to consider when
identifying hate speech spreaders, as they helped to improve the results of the word embeddings in
Spanish, and they obtained better results than word embedding for the English dataset. However, the
task of detecting hate speech spreaders turned out to be very difficult for the English dataset. The best
accuracy result was only 70% in our test partition, which accounted for 58% in the test provided in the
shared task. Word embeddings were investigated for this language, but they were not included because
they showed not accurate results, contrary to Spanish. The difference in accuracy between English and
Spanish may indicate that users have different hate-spreading behaviors in different cultures. Future
work will include adding more features, such as TF-IDF based n-grams for both words and characters.</p>
    </sec>
    <sec id="sec-10">
      <title>6. References</title>
      <p>[5] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N.
Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information
Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/978-3-030-22948-1\_5.
[6] H. J. Jarquín-Vásquez, M. Montes-y Gómez, L. Villaseñor-Pineda, Not all swear words are
used equal: Attention over word n-grams for abusive language identification, in: Mexican
Conference on Pattern Recognition, Springer, 2020, pp. 282–292.
[7] S. Zimmerman, U. Kruschwitz, C. Fox, Improving hate speech detection with deep learning
ensembles, in: Proceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018), 2018.
[8] F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, G. Inches, Overview of the author profiling
task at pan 2013, in: CLEF Conference on Multilingual and Multimodal Information Access
Evaluation, CELCT, 2013, pp. 352–365.
[9] F. Rangel, P. Rosso, M. Potthast, M. Trenkmann, B. Stein, B. Verhoeven, W. Daelemans, et
al., Overview of the 2nd author profiling task at pan 2014, in: CEUR Workshop Proceedings, volume
1180, CEUR Workshop Proceedings, 2014, pp. 898–927.
[10] F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th author profiling task at pan2017:
Gender and language variety identification in twitter, Working notes papers of the CLEF (2017)
1613–0073.
[11] G. Van Rossum, The python library reference, release 3.8. 2, Python Software Foundation
(2020) 36.
[12] M. Honnibal, I. Montani, spacy 2: Natural language understanding with bloom embeddings,
convolutional neural networks and incremental parsing, To appear 7 (2017) 411–420.
[13] Elias Dabbas, advertools: productivity and analysis tools to scale your online marketing,
2021.URL: https://advertools.readthedocs.io/en/master/readme.html.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Poletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>Resources and benchmark corpora for hate speech detection: a systematic review, Language Resources and Evaluation (</article-title>
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>A survey on automatic detection of hate speech in text, ACM Computing Surveys (CSUR) 51 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. D. L. P.</given-names>
            <surname>Sarracén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          , E. Fersini,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <source>Profiling Hate Speech Spreaders on Twitter Task at PAN</source>
          <year>2021</year>
          ,
          <article-title>in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. D. L. P.</given-names>
            <surname>Sarracén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          , I. Markov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolska</surname>
          </string-name>
          , , E. Zangerle, Overview of PAN 2021:
          <article-title>Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection</article-title>
          ,
          <source>in: 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ), Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>