=Paper= {{Paper |id=Vol-2936/paper-176 |storemode=property |title=Using N-grams and Statistical Features to Identify Hate Speech Spreaders on Twitter |pdfUrl=https://ceur-ws.org/Vol-2936/paper-176.pdf |volume=Vol-2936 |authors=Eszter Katona,Jakab Buda,Flora Bolonyai |dblpUrl=https://dblp.org/rec/conf/clef/KatonaBB21 }} ==Using N-grams and Statistical Features to Identify Hate Speech Spreaders on Twitter== https://ceur-ws.org/Vol-2936/paper-176.pdf
Using N-grams and Statistical Features to Identify Hate Speech
Spreaders on Twitter
Notebook for PAN at CLEF 2021

Eszter Katona1, Jakab Buda 2 and Flora Bolonyai 3
1
  katona.eszter.rita@gmail.com
2
  jakab.buda@gmail.com
3
  f.bolonyai@gmail.com


                Abstract
                In this notebook, we summarize our work process of preparing a software for the PAN 2021
                Profiling Hate Speech Spreaders on Twitter task. Our final software was a stacking ensemble
                classifier of different machine learning models; a mixture of models using word n-grams as
                features and models based on statistical features extracted from the Twitter feeds. Our software
                uploaded to the TIRA platform achieved an accuracy of 70% in English and 79% in Spanish.

                Keywords1
                Hate Speech Recognition, Text Classification

1. Introduction

    The aim of the PAN 2021 Profiling Hate Speech Spreaders on Twitter task [18] was to investigate
whether the author of a given Twitter feed is likely to spread hate speech. The training and test sets of
the task consisted of English and Spanish Twitter feeds [19].
    As online communication gains an increasingly important role, online hate speech also spreads.
Although the question of how to tackle this problem is hard [6], and even the definition of hate speech
is debated [8], its harm is evident [10]. Therefore, its detection is an essential task.
    We used an ensemble of different machine learning models to provide a prediction for each user.
All of our sub-models handle the Twitter feed of a user as a unit and determine a probability for each
user how likely they are to be labelled as hate speech spreaders. For the final predictions, these sub-
models are combined using a logistic regression.
    In Section 2 we present some related works on classifying hate speech in social media and profiling
hate speech spreaders. In Section 3 we describe our approach in detail together with the extracted
features and models. In Section 4 we present our results. In Section 5 we discuss some potential future
work and in Section 6 we conclude our notebook.

2. Related Works
   There are many different approaches to identify hate speech. [21] uses Deep Convolutional Neural
Network that could be called the state of the art as the DCNN model outperforms other classifiers. [1]
compares 8 machine learning models. Based on their comparison SVM and Random Forest models
show the highest, and KNN shows the lowest performance. Since we could not find an overall “best-
practice”, we decided to try some different models. We have found that SVMs [5, 9, 14, 20, 22],
XGBoost [28], logistic regression [26] and random forest [3] models are also commonly used for author
profiling and text classification purposes.


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
               ©️ 2021 Copyright for this paper by its authors.
               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
               CEUR Workshop Proceedings (CEUR-WS.org)
   In the specific domain of classifying hate speech in social media [4] describes several models using
both neural networks and statistical based predictive models such as SVMs. [2] concerned with
identifying misogynistic language in social media found SVMs using token n-grams as predictive
features to achieve the highest accuracy in the task out of a number of different machine learning
models.
   Using word n-grams as features for author profiling has been shown to be effective [5, 9, 14, 20, 22,
25], especially with TF-IDF weighting [27]. Statistical features, such as the number of punctuation
marks [22, 26], medium-specific symbols (for example hashtags and at signs in tweets, links in digital
texts) [11, 13, 20, 22, 24, 26], emoticons [11, 13, 20, 23, 26] or stylistic features [13] are also commonly
used for text classification purposes. For text classification stacking methods are also commonly used
[12].
   In [8] the authors highlight “Othering Language” that turned out to be an important input for our
research, as some of our models focus on the wording of the texts. Othering is an “us vs. them” that has
an important role in dehumanization and prejudices against groups or people. [15] collected peer-
reviewed publications for their systematic review on hate speech detection. They collected lexica of
hate speech and m ention 8 published resources; however, they emphasize that most of the authors
develop ad-hoc lexica.
   The definition of hate speech is not an evident task. [8] provides a great overview of the different
approaches. We did not have to deal with this question since the labelled dataset was provided for the
task [19].

3. Our Approach

    Our approach can be considered as an extension and improvement of the software we developed for
the PAN 2020 Profiling Fake News Spreaders on Twitter task [17]. Compared to the software described
in [7], we experimented with the addition of models that use topic probabilities based on topic models
as predictive features, additional descriptive statistics among the features of the user-wise statistical
models and a dictionary-based model relying on n-grams that are the most distinctive of the Tweets
within each label.

   3.1 The corpus and the environment setup

        3.1.1 The corpus

    The corpus for the PAN 2021 Profiling Hate Speech Spreaders on Twitter task [19] consists of one
English and one Spanish corpus, each containing 200 XML files. Each of these files contains 200 tweets
from an author. Because of the moderate size of the corpus, we wanted to avoid splitting the corpus into
a training and a development set. Therefore, we used cross-validation techniques to prevent overfitting.
Similarly to the PAN Author Profiling task in 2020, the dataset this year came pre-cleaned: all URLs,
hashtags and user mentions in the tweets were changed to standardized tokens.


        3.1.2 Environment setup
   We developed our software using the Python language (version 3.7). To build our models we mainly
used the following packages: scikit-learn2, xgboost3, spacy4, emoji5, lexical-diversity6, pandas7 and
numpy8. Our codes are available on GitHub9.


    3.2 Our models

           3.2.1 N-gram models

   We experimented with a number of machine learning models based on word n-grams extracted from
the text. Precisely, we investigated the performance of regularized logistic regressions (LR), random
forests (RF), XGBoost classifiers (XGB) and linear support vector machines (SVM). For all four
models, we ran an extensive grid search combined with five-fold cross-validation to find the optimal
text pre-processing, vectorization technique and modeling parameters. We tested the same parameters
for the English and Spanish data. We investigated two types of text cleaning methods for all models.
The first method (M1) removed all non-alphanumeric characters (except #) from the text, while the
second method (M2) removed most non alphanumeric characters (except #) but kept emoticons and
emojis. Both methods transformed the text to lower case. Regarding the vectorization of the corpus, we
experimented with a number of parameters. We tested different word n-gram ranges (unigrams,
bigrams, unigrams and bigrams) and also looked at different scenarios regarding the minimum overall
document frequency of the word n-grams (3, 4, 5, 6, 7, 8, 9, 10) included as features. Additionally, we
included a wide range of hyperparameter values for each model in our grid search. For a more detailed
description          of          the         tested          hyperparameters,          see          [7].
Table 1
The best performing text cleaning methods, vectorization parameters and model hyperparameters for
the n-gram based machine learning models
                                                         Vectorization                  Model
                                     Text                                                          10
   Language          Model                         N-grams       Min. global     hyperparameters
                                   cleaning
                                                                 occurrence
                       LR             M2           bigrams             9                C=0.1
                                                                                        B=300
                       RF             M2          unigrams            10
                                                                                min_samples_leaf=7
                      SVM             M2          unigrams             3                 C=1
       EN                                                                              eta= 0.3
                                                                                    max_depth=4
                                                   uni- and
                      XGB             M2                               8       colsample_bytree=0.5
                                                   bigrams
                                                                                   subsample=0.7
                                                                                 n_estimators=200
                       LR             M1          unigrams             9                C=100
                                                                                        B=400
                       RF             M1           bigrams             8
       ES                                                                       min_samples_leaf=5
                      SVM             M1          unigrams            10                C=100
                      XGB             M1                               7              eta= 0.01


2
  https://scikit-learn.org/
3
  https://xgboost.readthedocs.io/
4
  https://spacy.io/
5
  https://pypi.org/project/emoji/
6
  https://pypi.org/project/lexical-diversity/
7
  https://pandas.pydata.org/
8
  https://numpy.org/
9
  https://github.com/pan-webis-de/katona21
10
   Parameter names in the relevant Python package/function. Detailed description in [7].
                                                    uni- and                           max_depth=5
                                                    bigrams                       colsample_bytree=0.5
                                                                                      subsample=0.7
                                                                                    n_estimators=200



        3.2.2 User-wise statistical model

    Apart from the n-gram based models, we constructed a model based on statistical variables
describing all tweets of each author, thus giving one more prediction per author. The variables used in
this model are as follows:
        • the mean length of the 200 tweets of the authors both in words and in characters;
        • the minimum length of the 200 tweets of the authors both in words and in characters;
        • the maximum length of the 200 tweets of the authors both in words and in characters;
        • the standard deviations of the length of the 200 tweets of the authors both in words and in
            characters;
        • the range of the length of the 200 tweets of the authors both in words and in characters;
        • the number of retweets in the dataset by each author;
        • the number of URL links in the dataset by each author;
        • the number of hashtags in the dataset by each author;
        • the number of mentions in the dataset by each author;
        • the number of emojis in the dataset by each author;
        • the number of ellipses used at the end of the tweets in the 200 tweets of the authors;
        • the number of words in all capitals (except for the masked URLs, RTs and user mentions)
            in the 200 tweets of the authors;
        • a stylistic feature, the type-token ratio to measure the lexical diversity of the authors (in the
            dataset each author has 200 tweets thus the number of tokens per author does not differ as
            much that it would cause a great diversity in the TTRs).
    This gives a total of 18 statistical variables. Since we used an XGBoost classifier, we did not
normalize the variables and the linear correlation between the variables posed no problem.
    To find the best hyperparameter set, we used a five-fold cross-validated grid search and finally
refitted the best model on the whole data. The cross-validated accuracies achieved this way are 70%
and 74% for the English and Spanish data respectively. Table 2 contains the best hyperparameters
found.

Table 2
The best model hyperparameters for the XGBoost model using statistical features
        Parameter name                                Parameter values
                                              EN                                 ES
     Column sample by node                   0.8                                0.8
     Column sample by tree                   0.9                                0.8
             gamma                             1                                 4
          Learning rate                      0.1                                0.1
           Max. depth                          2                                 2
        Min. child weight                      2                                 4
      Number of estimators                   150                                200
              alpha                          0.7                                0.7
           Subsample                           1                                0.6

   For our early-bird submission on TIRA [16], we used a stacking ensemble as described in 3.2.4 of
the n-gram based models and the XGBoost model using descriptive statistics. In order to achieve higher
accuracy with our final software, we experimented with the addition of new models. We tested models
using n-grams from dictionaries containing the most typical tokens for each label and models that rely
on features extracted from running topic modeling on the training data.

             3.2.3 Dictionary-based n-gram models

    We also built an XGBoost model for each language based on a dictionary of n-gram features
retrieved from fitting separate vectorizers on each target group subcorpora. To construct this dictionary
we first selected the most frequent n-grams of each subcorpora (the number of the most frequent features
selected was tuned during the hyperparameter search from 200, 400, 800, and 2000 features). Then we
discarded those that are also amongst the most frequent n-grams in the other group (the number of the
most frequent n-grams considered here is provided as a ratio of the feature selection number and was
also tuned during the hyperparameter search, the ratios examined were 1, 1.2 and 1.5). Finally, we
combined the two separate dictionaries and according to a boolean hyperparameter variable, we decided
whether to also include the most frequent n-grams in the whole corpus or not. For this model we
investigated three more text cleaning methods in addition to the previously mentioned M1 method, all
three based on M1: M3 is changing standardized tokens in such a way that the vectorizer can
differentiate them from the words of the tweets; M4 is a lemmatized version of M3, and M5 is a version
of M4 where all stopwords are removed11. To find the optimal hyperparameters we first defined a large
space (373240 grid points) and randomly tested 10% of the grid points, then narrowed down the
hyperparameter space and made a more exhaustive grid search.
    To find the best hyperparameter set, we used a five-fold cross-validated grid search and finally
refitted the best model on the whole data. The cross-validated mean accuracies achieved this way are
65.7% and 80.5% for the English and Spanish data respectively. Table 3 contains the best
hyperparameters found.

Table 3
The best performing text cleaning methods, dictionary forming and vectorization parameters and
model hyperparameters for the dictionary-based n-gram models
             Parameter type and name                               Parameter value
                                                              EN                     ES
                  text cleaning                              M5                      M1
                          number of features                 200                    800
                               n-grams             uni-, bi- and trigrams uni-, bi- and trigrams
     dictionary      min. global occurrence ratio           0.005                  0.005
    forming and      max.  global occurrence ratio           0.95                   0.99
   vectorization         whether to add most
                     frequent features from full            False                   True
                                corpus
                                        subsample                                 0.7     0.8
                                   column sample by tree                          0.9     0.7
          model                    column sample by node                           1       1
     hyperparameters                    max depth                                  6       3
                                    number of estimators                          50      50
                                           alpha                                  0.1     0.1

             3.2.4 Stacking ensemble



11
     We used spacey built-in language models to lemmatize and remove stopwords.
   After identifying the best hyperparameters for the six mentioned models with cross-validation, we
had to find a reliable ensemble method. To avoid overfitting this ensemble model to the training set, we
did not train it using the predictions of the five final trained models. Instead, we wanted to create a
dataset that represents the predictions that are produced by our models on previously unseen data. To
do this, we refitted the six sub-models with the cross-validated hyperparameters five times on different
chunks of the original training data (each consisting of tweets from 160 users). The predictions given
by these five models to the 40 remaining users were appended to the training data of the ensemble
model, thus this training set consisted of predictions given to all 200 users in the training data, but these
predictions were given by five different models in case of each model type. The sample created this
way can be interpreted as an approximation of a sample from the distribution of the predictions of the
final five models on the test set. We created a test set with the same method but with a different split of
the training data.
   We then used these constructed training and test sets to find the best ensemble from the following
three methods: majority voting, linear regression of predicted probabilities (this includes the simple
mean), and a logistic regression model. The best and most reliable results were given by the logistic
model; therefore, we used this model as our final ensemble method. Table 4 summarizes the logistic
regression coefficients for the probabilistic predictions of each model for both languages.

Table 4
Logistic regression coefficients for the predicted probabilities by each sub-model
               Model                                         Coefficient values
                                                   EN                               ES
                  LR                                0                              4.26
                SVM                               2.74                               0
                  RF                                0                                0
                 XGB                                0                                0
           Statistical XGB                          0                              2.59
  Dictionary-based n-gram XGB                     0.73                             -0.52

    The validity of this method is backed by the fact that our results on the training sets (an accuracy of
69% and 81% for the English and Spanish set respectively) were approximately the same as the final
results. Compared to the results last year (when only the coefficients of the RF model were 0 for both
languages), it seems that the different n-gram models are more similar this year. It is also worth pointing
out that in the case of the English ensemble model the prediction of the model based on descriptive
statistics is not taken into account either as opposed to the Spanish stacking model, where it is an
important feature. This could both mean that the information that can be captured based on descriptive
statistics is also mostly captured by the n-gram models or that in the case of English language different
statistical variables should be considered.

             3.2.5 Other experiments

    As hate speech tends to evolve around well-defined themes (it is often directed at women, religious
groups or ethnicities), we decided to experiment with topic models. We wanted to see how much fitting
a topic model as dimension reduction could help the discrimination of the categories. We used LDA
(latent dirichlet allocation) MALLET model from the Gensim package. We created 20 topics based on
two different methods: we examined the coherence scores of different runs and we also used a
Hierarchical Dirichlet Process (HDP) from the Gensim 12 package. Both methods showed that 20 topics
could work best for our data. To see how much topic models are able to contribute to the discrimination
of the groups, we first fitted the model on the whole set of training data. In this scenario, we saw that
machine learning models using the topic probabilities as predictive features perform comparably to our
other models. However, keeping in mind that in the test phase the model would receive unknown texts,

12
     https://radimrehurek.com/gensim/
we decided to conduct some further experiments and investigated how the machine learning models
using the topic probabilities as features perform in a 5-fold cross-validation, i.e. on unseen data. In this
scenario the performance of our models proved to be significantly worse compared to the other models,
so we decided not to include our LDA model in our final software.

4. Results

    Overall, we tested two versions of our software. For the early bird testing, we used our approach
from 2020 fitted to the 2021 data. In our final submission we added one more feature to the descriptive
statistical model and the dictionary-based model. As Table 5 shows, this resulted in some improvement
in the case of the English corpus but the accuracy on the Spanish test set slightly decreased.

Table 5
Accuracies achieved by the two versions of our software during the cross-validation process and on
the test set
      Language                Early bird software                       Final software
                     CV (training set)        Test set       CV (training set)       Test set
          ES                80%                 80%                81%                 79%
          EN                69%                 68%                69%                 70%



5. Future Work

   An interesting phenomenon that we already faced during the PAN20 Author Profiling task [14] and
which remained typical for our submissions in the PAN21 competition is that our models in general
perform better in Spanish than in English. This is true about all of our individual models regardless of
the features they used, and about the final ensemble model as well. Investigating and understanding this
issue offers a challenging opportunity for further research.

6. Conclusion

    In this notebook, we summarized our work process of preparing a software for the PAN 2021
Profiling Hate Speech Spreaders on Twitter task [18]. We originally started from our software
developed for the PAN 2020 Profiling Fake News Spreaders task consisting of a stacking ensemble of
different machine learning models based on n-grams and descriptive statistical features of the text. In
the hopes of higher accuracies, we included some new descriptive statistical features and a new model
relying on n-grams from a dictionary containing the most frequent n-grams in the tweets belonging to
each group. In order to achieve the highest accuracies, we conducted an extensive grid-search combined
with cross-validation to find the best hyperparameters for each of the models. Using the best
hyperparameters, the models were refitted on the full training dataset. To get a final prediction for each
user, we trained a logistic regression that used the probabilistic predictions of the sub-models as
features. Using the ensemble model, we were able to achieve nearly the same accuracy on the test set
as during the cross-validation process. Overall, our final software was able to identify hate speech
spreaders with a 70% accuracy among users that tweet in English, and with an 79% accuracy among
users that tweet in Spanish.

7. References
[1] Abro, S., Shaikh, Z. S., Khan, S., Mujtaba, G., Khand, Z. H. (2020). Automatic Hate Speech
    Detection using Machine Learning: A Comparative Study. In: International Journal of
    Advanced Computer Science and Applications 11(8) pp. 484-491. (2020)
[2] Anzovino M., Fersini E., Rosso P.: Automatic Identification and Classification of Misogynistic
    Language on Twitter. In: Proc. 23rd Int. Conf. on Applications of Natural Language to
    Information Systems, NLDB-2018, Springer-Verlag, LNCS(10859), pp. 57-64 (2018)
[3] Aravantinou, C., Simaki, V., Mporas, I., Megalooikonomou, V.: Gender Classification of Web
    Authors Using Feature Selection and Language Models. In: Speech and Computer Lecture
    Notes in Computer Science, pp. 226–33. (2015)
[4] Basile V., Bosco C., Fersini E., Nozza D., Patti V., Rangel F., Rosso P., Sanguinetti M.:
    SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in
    Twitter. Proc. SemEval 2019 (2019)
[5] Boulis, C., Ostendorf, M.: A quantitative analysis of lexical differences between genders in
    telephone conversations. In: Proceedings of the 43rd Annual Meeting on Association for
    Computational Linguistics - ACL '05. Morristown, NJ, USA: Association for Computational
    Linguistics, pp. 435-442 (2005)
[6] Brown, A.: What is so special about online (as compared to offline) hate speech? In: Ethnicities.
    18(3) pp. 297–326. (2018)
[7] Buda, J., Bolonyai, F.: An Ensemble Model Using N-grams and Statistical Features to Identify
    Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff, C., Ferro, N., N´ev´eol, A. (eds.)
    CLEF 2020 Labs and Workshops, Notebook Papers. CEURWS.org (Sep 2020)
[8] Fortuna P., Nunes S.: A survey on automatic detection of hate speech in text. ACM Computing
    Surveys (CSUR) 51.4 (2018)
[9] Garera, N., Yarowsky, D.: Modeling Latent Biographic Attributes in Conversational Genres.
    In: Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pp
    710-718 (2009)
[10]         Gelber, K., McNamara, L.: Evidencing the harms of hate speech. In: Soc Identities. 22
    pp. 324–341. (2016)
[11]         Gonzalez-Gallardo, C. E., Torres-Moreno, J. M., Rendon, A. M., Sierra, G.: Efficient
    social network multilingual classification using character, POS n-grams and Dynamic
    Normalization. In: IC3K 2016 - Proceedings of the 8th International Joint Conference on
    Knowledge Discovery, Knowledge Engineering and Knowledge Management. SciTePress, pp.
    307-314. (2016)
[12]         Hagen, M., Potthast, M., Büchner, M., and Stein, B. (2015). Webis: An ensemble for
    twitter sentiment detection. In Proceedings of the 9th International Workshop on Semantic
    Evaluation (SemEval 2015), pages 582– 589.; Stiven Zimmerman, Udo Kruschwitz, Cris Fox
    (2018). Improving hate speech detection with deep learning ensembles. In Proc. of the Eleventh
    Int. Conf. on Language Resources and Evaluation (LREC 2018)
[13]         Marquardt, J., Farnadi, G., Vasudevan, G., Moens, M., Davalos, S., Teredesai, A., De
    Cock, M.: Age and gender identification in social media. In: CLEF 2014 working notes, pp.
    1129-1136. (2014)
[14]         Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting Age and Gender in
    Online Social Networks. In: Proceedings of the 3rd International Workshop on Search and
    Mining User-Generated Contents. New York, NY, USA: Association for Computing
    Machinery, pp. 37-44. (2011)
[15]         Poletto, F., Basile V., Sanguinetti M., Bosco C., Patti V.: Resources and benchmark
    corpora for hate speech detection: a systematic review. Language Resources & Evaluation.
    https://doi.org/10.1007/s10579-020-09502-8 (2020)
[16]         Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research
    Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing
    World - Lessons Learned from 20 Years of CLEF. Springer. (2019)
[17]         Rangel F., Giachanou A., Ghanem B., Rosso P.: Overview of the 8th Author Profiling
    Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: L. Cappellato, C. Eickhoff,
    N. Ferro, and A. Névéol (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR
    Workshop Proceedings.CEUR-WS.org (2020)
[18]        Rangel F., De la Peña Sarracén G. L., Chulvi B., Fersini E., Rosso P.: Profiling Hate
    Speech Spreaders on Twitter Task at PAN 2021. In: Faggioli G., Ferro N., Joly A., Maistro M.,
    Piroi F. (eds.) CLEF 2021 Labs and Workshops, Notebook Papers. CEUR Workshop
    Proceedings.CEUR-WS.org (2021)
[19]        Rangel F., Chulvi B., De la Peña G., Fersini E., Rosso P.: Profiling Hate Speech
    Spreaders on Twitter [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3692319 (2021)
[20]        Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in
    Twitter. In: SMUC '10: Proceedings of the 2nd international workshop on Search and mining
    user-generated contents. Pp. 37-44. (2010)
[21]        Roy, P. K., Tripathy, A. K., Das, T. K., Gao, X. Z.: A Framework for Hate Speech
    Detection Using Deep Convolutional Neural Network. In: IEEE Access 8, pp. 204951-204962.
    (2020)
[22]        Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author Profiling: Predicting Age and
    Gender from Blogs Notebook for PAN at CLEF 2013. in: Working Notes for CLEF 2013
    Conference. (2013)
[23]        Sboev, A., Litvinova, T., Voronina, I., Gudovskikh, D., Rybka, R.: Deep Learning
    Network Models to Categorize Texts According to Author's Gender and to Identify Text
    Sentiment. In: Proceedings - 2016 International Conference on Computational Science and
    Computational Intelligence, CSCI 2016. Institute of Electrical and Electronics Engineers Inc.,
    pp. 1101-1106. (2017)
[24]        Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of Age and Gender on
    Blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
    American Association for Artificial Intelligence (AAAI), pp. 199- 205. (2006)
[25]        Stout, L., Musters, R., Pool, C.: Author Profiling based on Text and Images Notebook
    for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the
    Evaluation Forum. (2018)
[26]        Volkova, S., Bachrach, Y.: On Predicting Sociodemographic Traits and Emotions from
    Communications in Social Networks and Their Implications to Online Self Disclosure. In:
    Cyberpsychology, Behavior, and Social Networking (Mary Ann Liebert Inc.) 2015/12, pp. 726-
    736. (2015)
[27]        Yildiz, T.: A comparative study of author gender identification. In: Turkish Journal of
    Electrical Engineering and Computer Science 27, pp. 1052-1064. (2019)
[28]        Zhang, X., Yu, Q.: Hotel reviews sentiment analysis based on word vector clustering.
    In: 2nd IEEE International Conference on Computational Intelligence and Applications
    (ICCIA), Beijing, pp. 260-264. (2017)