=Paper= {{Paper |id=Vol-2380/paper_191 |storemode=property |title=Bots and Gender Prediction Using Language Independent Stylometry-based Approach |pdfUrl=https://ceur-ws.org/Vol-2380/paper_191.pdf |volume=Vol-2380 |authors=Shaina Ashraf,Omer Javed,Muhammad Adeel,Haider Iqbal,Rao Muhmmad Adeel Nawab |dblpUrl=https://dblp.org/rec/conf/clef/AshrafJAIN19 }} ==Bots and Gender Prediction Using Language Independent Stylometry-based Approach== https://ceur-ws.org/Vol-2380/paper_191.pdf
              Bots and Gender Prediction Using Language
              Independent Stylometry-based Approach
                            Notebook for PAN at CLEF 2019

                   Shaina Ashraf, Omer Javed, Muhammad Adeel, Haider Ali
                                Rao Muhammad Adeel Nawab

   Department of Computer Science, COMSATS University Islamabad, Lahore Cam-
                                 pus, Pakistan.
                    shainaashraf@cuilahore.edu.pk, {omerjaved11,
                    mirzaadeel6233, haideriqbalm11}@gmail.com,
                                        adeelnawab@cuilahore.edu.pk



         Abstract This paper describes our participation for the Bots and Gender Pro-
         filing task at PAN 20191. The aim of this task is to first classify a profile either
         as bot or human. If the profile is written by a human, it should be further classified
         as male or female. Our proposed approach is based on language independent sty-
         lometry features. A total of 27 language independent stylometry features (18 are
         character-based features and remaining 9 are emotion-based features) are used to
         build the system for Bots and Gender Profiling task. On training dataset, for Eng-
         lish language, Accuracy scores of 0.97 and 0.80 are obtained for bot and human
         classification task and male / female classification task respectively. For Spanish
         language, Accuracy of 0.93 and 0.75 is obtained for bot and human classification
         task and male / female classification task respectively. On test dataset 1, for Eng-
         lish language, Accuracy scores of 0.92 and 0.76 are obtained for bot and human
         classification task and male / female classification task. For Spanish language,
         Accuracy of 0.86 and 0.75 is obtained for bot and human classification task and
         male / female classification task respectively. On test dataset 2, for English lan-
         guage, bot and human classification task and male/female classification task ob-
         tained Accuracy scores of 0.92 and 0.76 respectively, whereas for Spanish lan-
         guage, bot and human classification task and male/female classification task ob-
         tained Accuracy scores of 0.88 and 0.72 respectively.



         Keywords: Bot and Gender Profiling, Author Profiling, Stylometry-based Fea-
         tures, Emotion-based Features, Emojis




1 Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0

International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland.
1   Introduction
   As the usage of social networking platforms such as Facebook, Twitter, Instagram,
blogs and community forums is arising, the communication methods are changing.
People feel free to talk, discuss and post their reviews, comments on such channels
more frequently. Many people rely on social forums i.e. Reddit, Yelp, Quora and Am-
azon message boards, etc., to get information, feedback and recommendations for dif-
ferent products and services. However, a large number of users on social networking
sites are taking miss-advantage of such forums by making fake profiles, spams and
bots. In recent years, bots are being used to pose as humans on social networking
platforms to influence other social media users with ideological, political or commer-
cial purposes. Bots can exaggerate the popularity of products by writing positive re-
views and rating them. They can also sabotage the reputation of competitive products
through negative reviews and ratings. Furthermore, bots are also being widely used
for fake news spreading. Therefore, it is important to develop author profiling systems
which can discriminate bot profiles from human ones.
   The study presents a stylometry-based approach to address the problem of Bots and
Gender Profiling. A total of 27 language independent features are used, which can be
broadly categorized into: (1) character-based features and (2) emotions-based fea-
tures. A range of classifiers have been applied including Logistic Regression, Random
Forest, Linear SVC, BernoulliNB, MultinomialNB and SVC (Support Vector Classi-
fier) to train and test our proposed system. The developed system is deployed on TIRA
[9] for final evaluation on test datasets. A detailed comparison of all the systems pre-
sented in the PAN 2019 Bot and Gender Profiling task can be found in [10].
   The rest of this paper is organized as follows: Section 2 describes related work on
author profiling, Section 3 presents our proposed approach, Section 4 describes the
experimental setup, Section 5 presents results and their analysis. Finally, Section 6
concludes the paper with future work directions.

2   Related Work
    In previous studies, many researchers have explored different methods i.e. stylom-
etry-based, content-based, topic-based, emotion-based and deep learning for finding
different demographics of an author on social media. In [1], the authors have applied
stylometry-based approach for cross-genre author profiling. Their set of stylometry-
based features included 6 vocabulary richness features, 26 character-based features, 16
syntactic features and 7 lexical features. Promising results were obtained using their
proposed set of stylometry-based features (Accuracy of 0.576 for gender classification,
0.371 for age classification and 0.256 for combined classification of age and gender).
    In [3], the authors have classified the humans and bots by learning tweets patterns
and then further categorized bots in to classes i.e. spam bots, consumption and broad-
cast bots. They proposed a new profiling framework that consists of entropy-based fea-
tures such as timings of tweets, hashtags, URLโ€™s and followers count etc. The author
worked on nearly 159 thousand bots and human data on Twitter. The experiments re-
sults show efficient results on malicious and benign bots to find the interesting behavior
traits. In [14], authors have investigated content-based features (word and character n-
grams) and 64 stylometry-based features (11 lexical word-based, 47 lexical character-
based and 6 vocabulary measures) for the identification of gender and age traits on
multilingual corpora.
   In [18], the authors have focused on instance-based, prototype based and distance-
based classification strategy. They have extracted different features i.e. frequency of
negative and positive emoticons, mark of retweets, no of hashtags and part of speech
tags for the identification gender and language task.
    In [6], the authors have detected bots from Wikidata by extracting comment-based
features of user. The comments-based features help to examine the editing behavior of
registered and non-registered users. The author used the random forest classifier and a
gradient boosting classifier and applied optimization by hyper parameter for both mod-
els. The performance of model is efficient against the registered user information.
   In [19], the authors have used image and text-based combined features for gender
identification. They have represented text using bag of terms (BoT) model and for CNN
model for image representation.

3    Proposed Language Independent Stylometry-based Approach
     Writing style of an author helps to identify various attributes of an author, for ex-
ample, age, gender, personality type, occupation and political interest etc. It is expected
that the writing style of a human is significantly different from a bot. Therefore, sty-
lometry features [13] are likely to be very helpful in discriminating bot profiles from
human ones. Another major difference between a human profile and a bot profile is the
usage of emotions. The profile generated by a bot is likely to be plain text, whereas on
the other hand, a human profile is likely to be a mixture of both text and emotions.
Considering the above two factors, our proposed approach uses a combination of char-
acter-based stylometry features and emotions-based features to distinguish human from
bot. Note that our proposed approach uses language independent stylometry features
i.e. they can be applied on any language for bot and human profiling.
    In our proposed system, a total of 27 stylometry-based features are used (18 features
are character-based and 9 are emotion-based). The set of character-based features in-
cludes: (1) url_count, (2) space_count, (3) capital_count, (4) text_length, (5)
curly_brackets_count, (6) round_brackets_count, (7) underscore_count, (8) ques-
tion_mark_count, (9) exclamation_mark_count, (10) dollar_mark_count, (11) amper-
sand_mark_count, (12) hash_count, (13) tag_count, (14) slashes_count, (15) opera-
tor_count, (16) punc_count, (17) line_count, (18) word_count. The set of emotion-
based features includes: (1) emoji_count, (2) face_smiling, (3) face_affection, (4)
face_tongue, (5) face_hand, (6) face_neutral_skeptical, (7) face_concerned, (8) mon-
key_face, (9) emotions (for details see Table 3.1).
Table 3.1 List of language independent stylometry-based features used in the development of
             our proposed system for PAN 2019 Bot and Gender Profiling task

 No     Feature                       Description

        emoji_count                   Count all kind of emojis
  1


        face_smiling                  Count ๐Ÿ˜€๐Ÿ˜ƒ๐Ÿ˜„๐Ÿ˜๐Ÿ˜†๐Ÿ˜…๐Ÿคฃ๐Ÿ˜‚๐Ÿ™‚๐Ÿ™ƒ๐Ÿ˜‰๐Ÿ˜Š๐Ÿ˜‡
  2


        face_affection                Count ๐Ÿฅฐ๐Ÿ˜๐Ÿคฉ๐Ÿ˜˜๐Ÿ˜—๐Ÿ™‚๐Ÿ˜š๐Ÿ˜™
  3

        face_tongue                    Count ๐Ÿ˜‹๐Ÿ˜›๐Ÿ˜œ๐Ÿคช๐Ÿ˜๐Ÿค‘
  4

        face_hand                     Count ๐Ÿค—๐Ÿคญ๐Ÿคซ๐Ÿค”
  5

        face_neutral_skeptical        Count ๐Ÿค๐Ÿคจ๐Ÿ˜๐Ÿ˜‘๐Ÿ˜ถ๐Ÿ˜๐Ÿ˜’๐Ÿ™„๐Ÿ˜ฌ๐Ÿคฅ
  6

                                      Count
        face_concerned                ๐Ÿ˜•๐Ÿ˜Ÿ๐Ÿ™โ˜น๐Ÿ˜ฎ๐Ÿ˜ฏ๐Ÿ˜ฒ๐Ÿ˜ณ๐Ÿฅบ๐Ÿ˜ฆ๐Ÿ˜ง๐Ÿ˜จ๐Ÿ˜ฐ๐Ÿ˜ฅ๐Ÿ˜ข๐Ÿ˜ญ๐Ÿ˜ฑ๐Ÿ˜–๐Ÿ˜ฃ
  7
                                      ๐Ÿ˜ž


        monkey_face                   Count ๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š
  8

                                      Count
        Emotions
  9                                   ๐Ÿ’‹๐Ÿ’Œ๐Ÿ’˜๐Ÿ’๐Ÿ’–๐Ÿ’—๐Ÿ’“๐Ÿ’ž๐Ÿ’•๐Ÿ’Ÿโฃ๐Ÿ’”โค๐Ÿงก๐Ÿ’›๐Ÿ’š๐Ÿ’™๐Ÿ’œ๐Ÿ–ค


        url_count                     Count all kind of link/URLs
 10

        space_count
 11                                   Spaces count

 12     capital_count                 Capital letter count

 13     text_length                   Total length of message

 14     curly_brackets_count          Count { }
 No    Feature                       Description

 15    round_brackets_count          Count ( )

 16    underscore_count              Count _

 17    question_mark_count           Count ?

 18    exclamation_mark_count Count !

 19    dollar_mark_count             Count $

 20    ampersand_mark_count          Count &

 21    hash_count                    Count #

 22    tag_count                     Count @

 23    slashes_count                 Count Slashes // / \

 24    operator_count                Count Operators +-*/%<>^|

 25    punc_count                    Count Puntuations '",.:;`

 26    line_count                    Count next lines \n

 27    word_count                    Count Words A-Za-z




Table 4.1 Distribution of data in the PAN19-author-profiling-training corpus for Bot and
                                     Gender Profiling task

                   Total Profiles         Bot               Male           Female

   English             4120              2060               1030             1030

   Spanish             3000              1500               750              750
4    Experimental Setup
   This section describes the main statistics of the training corpus, evaluation method-
ology and evaluation measures.
4.1 Training Corpus
    We used PAN19-author-profiling-training dataset to train our proposed system. We
have performed author profiling task for both languages i.e. English and Spanish. The
English training corpus contains 4,120 author profiles and each profile contains 100
tweets in English, whereas Spanish training corpus contains 3,000 author profiles and
each profile consists of 100 tweets in Spanish (see Table 4.1 for detailed statistics of
both corpora). Note that, in our proposed approach, no pre-processing or cleaning op-
erations were performed on both training and test datasets because URLโ€™s and hashtags
were used as features in the classification task.
4.2 Evaluation Methodology
     The tasks of predicting an authorโ€™s type as bot or human and determining gender
from his/her text are treated as supervised document classification tasks. We performed
binary classification tasks for distinguishing bot from human and then identification of
its gender. A range of classifiers were explored including Logistic Regression, Random
Forest classifier, LinearSVC, BernoulliNB, MultinomialNB and SVC to train and test
our proposed system. The numeric values generated by the 27 stylometry features (see
Section 3) were used as input to these classifiers.
4.3 Evaluation Measure
   Evaluation is carried out using Accuracy measure. Accuracy is defined as ratio of
correctly predicted profiles to total number of profiles.

             !"#$%& () *(&&%*+,- *,.//0)0%1 2&()0,%/
Accuracy =          3(+., 4"#$%& () 2&()0,%/




5    Results and Analysis
5.1 Results on Training Dataset
    Table 5.1 presents the Accuracy results of our proposed approach on PAN19-au-
thor-profiling-training dataset using 6 different machine learning algorithms. The best
results are obtained using Random Forest classifier for both English (0.970 Accuracy
for bot/human & 0.802 for gender prediction) and Spanish (0.935 Accuracy for bot/hu-
man & 0.755 for gender prediction) languages. As can be noted that these results are
very promising, highlighting the fact that language independent character-based, and
emotion-based features used in our proposed approach are useful in discriminating a
bot from human as well as distinguishing a male profile from a female one.
Table 5.1 Results obtained on PAN19-author-profiling-training corpus using our proposed
                   approach for PAN 2019 Bot and Gender Profiling task

                                   English                          Spanish
 Classifier
                                                                         Male/Femal
                        Bot/Human Male/Female           Bot/Human
                                                                              e

 Logistic Regression       0.906          0.7303            0.871             0.678

 Random Forest             0.970             0.802          0.935             0.755

 LinearSVC                 0.869          0.5209            0.749             0.577


 BernoulliNB               0.904             0.629          0.822             0.603


 MultinomialNB             0.813             0.577          0.796             0.657


 SVC                       0.479             0.490          0.505             0.469



5.2 Results on Test Datasets
    In PAN 2019 Bot and Gender Profiling task, final evaluation is carried out on two
test corpora: (1) PAN19-author-profiling-test-dataset1 corpus and (2) PAN19-author-
profiling-test-dataset2 corpus. Table 5.2 shows results obtained using our proposed
language independent stylometry-based approach on both test corpora. On PAN19-au-
thor-profiling-test-dataset1 corpus, for English language, Accuracy scores of 0.9280
and 0.7652 are obtained for bot/human and male/female classification tasks respec-
tively, whereas for Spanish language, 0.8611 and 0.7556 Accuracy scores are obtained
for human/bot and male/female classification tasks respectively. Similarly, on PAN19-
author-profiling-test-dataset2 corpus, for English language, Accuracy scores of 0.9227
and 0.7583 are obtained for bot/human and male/female classification tasks respec-
tively, whereas for Spanish language, 0.8839 and 0.7261 Accuracy scores are obtained
for human/bot and male/female classification tasks respectively.
   It can be noted that Accuracy results for English tweets are higher compared to
Spanish, even though same language independent features are extracted for both lan-
guages. The possible reason for this is that Spanish profiles in the train and test
Table 5.2 Results obtained on PAN19-author-profiling-test-dataset1 and PAN19-author-
profiling-test-dataset2 corpora using our proposed approach for PAN 2019 Bot and Gender
                                      Profiling task

                           English                              Spanish

    Corpus            Type:
                                     Gender:        Type:            Gender:
                      Bot/Huma
                                     Male/Female    Bot/Human        Male/Female
                      n

    PAN19-author-
    profiling-test-   0.9280         0.7652         0.8611           0.7556
    dataset1

    PAN19-author-
    profiling-test-   0.9227         0.7583         0.8839           0.7261
    dataset2



datasets of the PAN 2019 Bot and Gender Profiling task may contain text in more than
one language since the datasets provided by the PAN organizers contain raw tweets and
re-tweets i.e. no pre-processing and / or cleaning is performed. Consequently, perfor-
mance drops for the Spanish language. These results also show that the Accuracy for
the identification of type i.e. human/bot is very high compared to gender prediction
which shows that our proposed stylistic features are more suitable for discriminating
bot from human than gender discrimination. This is likely to happen because bots are
likely to generate profiles without emotions and humans generate profiles with a com-
bination of emotions and texts. Consequently, it makes it easier for the classifiers to
distinguish human from bot.

6    Conclusion
    This paper presents a language independent stylometry-based approach for the PAN
2019 Bot and Gender Profiling task. A total of 27 stylistic features were used to build
the proposed system (18 are character-based and 9 emotion-based). A range of classi-
fiers were also applied including Logistic Regression, Random Forest, LinearSVC,
BernoulliNB, MultinomialNB and SVC. Promising results were obtained on both test
datasets in the final evaluation.
   In future, we plan to apply deep learning methods for the PAN 2019 Bot and Gender
Profiling task.
References:
   1.   Ashraf, S., Iqbal, H. R., & Nawab, R. M. A. (2016, September). Cross-Genre Author
        Profile Prediction Using Stylometry-Based Approach. In CLEF (Working Notes) (pp.
        992-999).

   2.   Ferrara, E., Varol, O., Menczer, F., & Flammini, A. (2016, March). Detection of pro-
        moted social media campaigns. In tenth international AAAI conference on web and
        social media.

   3.   Oentaryo, R. J., Murdopo, A., Prasetyo, P. K., & Lim, E. P. (2016, November). On
        profiling bots in social media. In International Conference on Social Informatics (pp.
        92-109). Springer, Cham.

   4.   Shu, K., Wang, S., & Liu, H. (2018, April). Understanding user profiles on social me-
        dia for fake news detection. In 2018 IEEE Conference on Multimedia Information Pro-
        cessing and Retrieval (MIPR) (pp. 430-435). IEEE.

   5.   Rangel, F., Rosso, P., Potthast, M., & Stein, B. (2017). Overview of the 5th author
        profiling task at pan 2017: Gender and language variety identification in twitter. Work-
        ing Notes Papers of the CLEF.

   6.   Hall, A., Terveen, L., & Halfaker, A. (2018). Bot Detection in Wikidata Using Behav-
        ioral and Other Informal Cues. Proceedings of the ACM on Human-Computer Interac-
        tion, 2(CSCW), 64.

   7.   Rangel, Francisco, Paolo Rosso, Manuel Montes-y-Gรณmez, Martin Potthast, and
        Benno Stein. "Overview of the 6th author profiling task at pan 2018: multimodal gen-
        der identification in Twitter." Working Notes Papers of the CLEF (2018).

   8.   Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P.,
        Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.:
        Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Author-
        ship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J.,
        Rauber, A., Mรผller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Pro-
        ceedings of the Tenth International Conference of the CLEF Association (CLEF 2019).
        Springer (Sep 2019)

   9.   Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Archi-
        tecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing
        World - Lessons Learned from 20 Years of CLEF. Springer (2019)

   10. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots
       and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Mรผller, H. (eds.) CLEF
       2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)

   11. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Evalua-
       tions Concerning Cross-genre Author Profiling. In: Working Notes Papers of the CLEF
     2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org
     (2016)

12. Soler, J., and Wanner, L. 2016. A semi-supervised approach for gender identification.
    In Proceedings of the 10th International Conference on Language Resources and Eval-
    uation (LREC-2016), Portorozห‡, Slovenia, European Language Resources Association
    (ELRA).

13. Flekova, L., Ungar, L., and Preotiuc-Pietro, D. 2016. Exploring stylistic variation with
    age and income on Twitter. In Proceedings of the 54th Annual Meeting of the Associ-
    ation for Computational Linguistics (ACL 2016), Berlin, Germany.

14. Fatima, M., Hasan, K., Anwar, S., and Nawab, R. M. A. 2017. Multilingual author
    profiling on Facebook. Information Processing & Management 53(4): 886โ€“904.

15. Przybyla, P., and Teisseyre, P. 2015. What do your look-alikes say about you? Exploit-
    ing strong and weak similarities for author profilingโ€”Notebook for PAN at CLEF
    2015. In Evaluation Labs and Workshop โ€“ Working Notes Papers (CLEF-2015), Tou-
    louse, France. CEUR-WS.org.

16. Rangel, F., Rosso, P., Franco, M. A Low Dimensionality Representation for Language
    Variety Identification. In: Proceedings of the 17th International Conference on Intelli-
    gent Text Processing and Computational Linguistics (CICLingโ€™16), Springer-Verlag,
    LNCS(9624), pp. 156-169, 2018

17. Shrestha, P., Rey-Villamizar, N., Sadeque, F., Pedersen, T., Bethard, S., and Solorio,
    T. 2016. Age and gender prediction on health forum data. In Proceedings of the 10th
    International Conference on Language Resources and Evaluation (LREC-2016). Euro-
    pean Language Resources Association (ELRA).

18. Adame-Arcia, Y., Castro-Castro, D., Ortega-Bueno, R., Munฬƒ oz, R.,: Author Profiling,
    instance-based Similarity Classification. Notebook for PAN at CLEF 2017 (2017)

19. Taniguchi,T.,Sakaki,S.,Shigenaka,R.,Tsuboshita,Y.,Ohkuma,T.:AWeighted Combi-
    nation of Text and Image Classifiers for User Gender Inference, pages 87โ€“93. Associ-
    ation for Computational Linguistics (2015)