Author Profiling:
       Bot and Gender Prediction using a Multi-Aspect
                   Ensemble Approach
                            Notebook for PAN at CLEF 2019

 Hamed Babaei Giglou1 , Mostafa Rahgouy1∗ , Taher Rahgooy2 , Mohammad Karami
                   Sheykhlan1 ,and Erfan Mohammadzadeh 1
       1
           University of Mohaghegh Ardabili, Computer Science Department, Ardabil, Iran
                         ∗
                           mostafarahgouy@student.uma.ac.ir
                 {hamedbabaeigiglou, mohammadkaramisheykhlan,
                             er.mohammadzadeh}@gmail.com
           2
             Tulane University , Computer Science Department, New Orleans, LA, USA
                                   trahgooy@tulane.edu


           Abstract Author Profiling is one of the most important tasks in authorship anal-
           ysis. In PAN 2019 shared tasks, the gender identification of the author is the main
           focus. Compared to the previous year the author profiling task is expended by
           having documents written by bots. In order to tackle this new challenge we pro-
           pose a two phase approach. In the first phase we exploit the TF-IDF features of
           the documents to train a model that learns to detect documents generated by bots.
           Next, we train three models on character-level and word-level representations
           of the documents and aggregate their results using majority voting. Finally, we
           empirically show the effectiveness of our proposed approach on the PAN 2019
           development dataset for author profiling.

           Keywords: Author Profiling, User Modeling, Natural Language Processing, Su-
           pervised Machine Learning, Stacking ensemble.


1     Introduction

As computational power grows and artificial intelligence techniques evolve, new chal-
lenges, that were out of reach of machine learning in few years ago, emerge. One such
a problem is author profiling. Different from the traditional authorship identification,
in which a closed set of possible authors is known, author profiling aims to determine
what are the characteristics of the authors: their age, gender, native language among
others[13].

    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
    In the past years, multiple shared tasks have been organized on the topic of author
profiling[11,8]. In this paper,we describe our approach for the Author Profiling shared
task at PAN 2019 [9]. This year’s Author Profiling task, is the 7th iteration of this task
and is different from the previous years, since the data now includes bots.
    The rest of the paper is organized as follows. Section 2 provides background and
presents some related works on author profiling in general. Section 3 introduce our
approach. Results are covered in Section 4. In Section 5 we draw a conclusion.


2      Related Work

The author attribution task is presented for the first time in PAN 2013 [10] in a single
domain setting and continued to be part of PAN afterwards. AP task has been expanded
and modified to more complicated and challenging tasks by considering multi-language
settings, bot setting [9]. It has been shown that TF-IDF weighted n-gram features, both
in terms of characters and words, are very successful in capturing especially gender
distinctions[11].
    In a similar work to our work, author of [3] undertook gender classification of Twit-
ter users and achieved 76% accuracy, when trained their model only on tweets, using
word unigram and bigrams and character 1- to 5-grams as features. In another related
work [4] make use of LSA on the TF-IDF matrix with a linear SVM classifier. SVM
has long proven a strong performance; however, they claimed that logistic regression
is worth considering in similar experiments. In another related work [1] make use of
traditional character n-gram analysis in combination with a linear SVM as a classifier.
They found the optimal values for their model with dynamic and ad-hoc grid search
approach and achieved reasonable results at PAN 2017.


3      Proposed Approach

In this section, we first describe how we pre-process the data, then we present the fea-
tures we extract from the text. Finally, we describe the supervised models we use for
classification.


3.1     Preprocessing

The first step in the proposed algorithm is to clean the input documents. In this step,
we combine each user tweets and then removed the Emojis, URLs, Mentions, Hashtags,
Smileys, Numbers, and Reserved words(RT, FAV) using tweet-preprocessor1 python li-
brary. Next, using NLTK 3.0 Toolkit[2] we removed the stop-words from tweets and
then lemmatize the words using WordNetLemmatizer(from NLTK).

 1
     https://pypi.org/project/tweet-preprocessor/
3.2   Type Modeling
Term Frequency-Inverse Document Frequency (TF-IDF) was a common feature trans-
formation technique for detecting author’s style. First, we build a vocabulary using pre-
processed train-set for each language with frequency term 5. Next, using the scikit-learn
[7] TfidfVectorizer method we convert a collection of raw tweets to a matrix of TF-IDF
features. In final we used It TF-IDF in combination with a supervised method called a
linear SVM. Based on work [1] which they are used the grid search tool shipped with
scikit-learn python machine learning library to optimized values are listed in Table-1
so we used that optimized values to optimizing own type detection system. Since we
didn’t get much time to try different set of these parameters with different classifiers
we only use these parameters with a linear SVM classifier. All runs were performed
with 5-fold cross-validation, while optimizing the accuracy target. Also, we used the
Sub-linear term frequency scaling. which uses 1 + log(TF) instead of TF.


                             Module        Parameters Possible values
                                         decode_error      replace
                          Feature
                                             norm            l1, l2
                          Extraction
                                         script accents true , false
                                          sublinear_tf True, False
                        Transformation      Scaling    MaxAbsScaler
                           Classifier           C       0.01 ,0.1 ,1 ,5
                                             kernel         linear
Table 1. Parameters obtained with grid search. Parameters with results in bold are optimal values.


3.3   Gender Modeling
Because the first part of the task is to identify the type of users and in case of human
we have to identify the gender of users. If in the first part of the task we predict authors
wrongly as a bot so we have to be precise to build an accurate system to compensation
the wrong predictions of the first part of the task. Since ensemble methods usually
produce more accurate solutions than a single model we used the ensemble method for
gender profiling which uses majority voting for making a final prediction for the test
instances.
Since we have two class here(male and female) so we use three different models for the
ensemble method, TF-IDF[14], N-gram, and Doc2Vec[5].

TF-IDF for TF-IDF model we use the same way that we used for type modeling in the
previous section with a linear SVM classifier.

N-gram To consider character level features we used character 4-gram frequencies
in combination with a supervised method called a linear SVM method. The parame-
ters which we used for 4-gram are in table 2. These are obtained by grid search tools
from scikit-learn library. All runs were performed with 5-fold cross-validation, while
optimizing the accuracy target.


                     Module                Parameters           Possible values
                                 minimal document frequency            4, 5
                  Feature
                                         n-gram order                3, 4, 5
                  Extraction
                                           lowercase              True , False
                                         script accents           True , False
                Transformation              Scaling             MaxAbsScaler
                   Classifier                   C                0.01 ,0.1 ,1 ,5
                                             kernel                  linear
Table 2. Parameters obtained with grid search. Parameters with results in bold are optimal values.


Doc2Vec Document embeddings is a Paragraph Vector, an unsupervised algorithm that
learns fixed-length feature representations from variable-length pieces of texts, such
as sentences, paragraphs, and documents. With regard to this idea and based on work
[6] which they show that under certain settings the neural network-based features out-
perform the traditional features when using a logistic regression classifier. We train
Doc2Vec for each language on train-set using Gensim[12] library and MasAbsScaler as
scaling transformation for third model of ensemble method for gender identification.


3.4   Other Features

In the process of feature engineering, we explored many other ideas, some of which
preformed poorly and thus we did not get them involved to our final approach. Yet, we
feel some of them are worth mentioning.
    Emoji Regard by our hypothesize emojis play an important role in identifying gen-
der so ignoring emojis may lead to loss of valuable information about the author’s
gender. As a result, we used emoji in varies ways. Namely, we calculated the number of
occurrences of emoji in a given document. In the other case, we used emoji as sentiment
analysis features to distinguishing between male and female. However, using emoji as
a feature caused to collapse our results thus we decided to ignore them.
    Contracted Wordforms We use is based on the discrepancies in spelling for words
that allow contracted forms, e.g.,I will (I’LL), are not (aren’t). People typically favor
one of the alternatives, and thus we use forms based on contracted apostrophes as dis-
criminative features for detecting the gender of each author.
    Quotation Marks Some authors may prefer either single or double quotation marks.
We use the difference between the number of single and double quotes in a given doc-
ument to see which group(male, female) prefer to use single or double quotation.
    Sentence Length we noticed that some authors prefer to write long sentences where
they use more conjunction in their text whereas some of them use short sentences as a
result we consider length as a feature also we calculate the number of conjunctions in a
given document and treat them as a feature.
    Negations Another feature we use is based on the negation form e.g., impossi-
ble(not possible) to identifying the similarity of authors. We calculated the number of
negations for each given document and added them to other features but we didn’t get
any notable result.


3.5   Final System


We submitted our final run, a linear SVM system for type identification(human and
bots) which uses TF-IDF and in case of human we use ensemble system for gender
identification(male and female) which uses TF-IDF , 4-gram with a linear SVM system,
and Doc2Vec with a LogisticRegression system. This process visualized in the figure 1.


                     Figure 1. The architecture of the proposed approach.
4     Experimental Results
In Table 3, the performance of three models for gender identification wich described
in section 3 are presented. It is clear that the ensemble of all three models is the best
model in average.


                                               Gender Models
          Language                Models                          Ensemble
                      TF-IDF      N-gram     Doc2vec     TF-IDF + N-gram + Doc2Vec
            EN        0.7906      0.7887      0.7709                0.8032
            SP        0.6543      0.6347      0.5695                0.6521
Table 3. Accuracy results on development dataset for each different models for Gender Modeling.


    Furthermore, we present the detailed results of the author profiling on development
dataset of PAN 2019 in Table 4. As you can see, the average score obtained by this
approach for English is 0.8330 and for Spanish is 0.7358 .


                                              Models
                                                                                          Average
Language           Type Model                   Gender Model
             TF-IDF    TP Test Size TF-IDF+N-gram+Doc2Vec    TP               Test Size
     EN      0.8629    1070     1240               0.8032              498      620       0.8330
     SP      0.8195    750       920               0.6521              300      460       0.7358

            Table 4. Detailed accuracy results on development dataset for each models.


5     Conclusion
In this paper, we proposed an algorithm for Author Profiling task in PAN 2019. We
present our supervised approach, which uses a TF-IDF representation of the documents
to distinguish between human and bots. Whereas, for detecting gender we present our
ensemble approach which use a TF-IDF, N-gram, and Doc2Vec as feature extractions
which makes predictions using an ensemble of diverse models including LinearSVC and
Logistic Regression. Next, we uses dynamically determined parameters from an ad-hoc
grid search approach to find the optimal values. Our evaluation shows that our approach
is very capable of distinguishing between human and bot. However, identifying gender
needs further work. The proposed algorithm implemented in Python and published on
GitHub2 .
 2
     https://github.com/HamedBabaei/PAN2019_bots_gender_profiling
References
 1. Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: N-GrAM:
    New Groningen Author-profiling Model—Notebook for PAN at CLEF 2017. In:
    Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and
    Workshop – Working Notes Papers, 11-14 September, Dublin, Ireland. CEUR-WS.org (Sep
    2017), http://ceur-ws.org/Vol-1866/
 2. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with
    the natural language toolkit. " O’Reilly Media, Inc." (2009)
 3. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In:
    Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp.
    1301–1309. EMNLP ’11, Association for Computational Linguistics, Stroudsburg, PA,
    USA (2011), http://dl.acm.org/citation.cfm?id=2145432.2145568
 4. Daneshvar, S., Inkpen, D.: Gender Identification in Twitter using N-grams and
    LSA—Notebook for PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L.
    (eds.) CLEF 2018 Evaluation Labs and Workshop – Working Notes Papers, 10-14
    September, Avignon, France. CEUR-WS.org (Sep 2018), http://ceur-ws.org/Vol-2125/
 5. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML
    (2014)
 6. Markov, I., Gómez-Adorno, H., Posadas-Durán, J.P., Sidorov, G., Gelbukh, A.F.: Author
    profiling with doc2vec neural network-based document embeddings. In: MICAI (2016)
 7. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
    Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
    of Machine Learning Research 12, 2825–2830 (2011)
 8. Potthast, M., Pardo, F.M.R., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview
    of pan’17 - author identification, author profiling, and author obfuscation. In: CLEF (2017)
 9. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
    Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019
    Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
10. Rangel Pardo, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the Author
    Profiling Task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013
    Evaluation Labs and Workshop – Working Notes Papers, 23-26 September, Valencia, Spain.
    CEUR-WS.org (Sep 2013)
11. Rangel Pardo, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.:
    Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In:
    Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF 2016 Evaluation Labs and
    Workshop – Working Notes Papers, 5-8 September, Évora, Portugal. CEUR-WS.org (Sep
    2016), http://ceur-ws.org/Vol-1609/16090750.pdf
12. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In:
    Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp.
    45–50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/publication/884893/en
13. Rodrigo Ribeiro Oliveira, R.d.: Using Character n-grams and Style Features for Gender and
    Language Variety Classification—Notebook for PAN at CLEF 2017. In: Cappellato, L.,
    Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop –
    Working Notes Papers, 11-14 September, Dublin, Ireland. CEUR-WS.org (Sep 2017),
    http://ceur-ws.org/Vol-1866/
14. Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)