=Paper= {{Paper |id=Vol-2380/paper_174 |storemode=property |title=Author Profiling on Social Media: An Ensemble Learning Approach using Various Features |pdfUrl=https://ceur-ws.org/Vol-2380/paper_174.pdf |volume=Vol-2380 |authors=Youngjun Joo,Inchon Hwang |dblpUrl=https://dblp.org/rec/conf/clef/JooH19 }} ==Author Profiling on Social Media: An Ensemble Learning Approach using Various Features== https://ceur-ws.org/Vol-2380/paper_174.pdf
        Author Profiling on Social Media: An Ensemble
           Learning Model using Various Features
                         Notebook for PAN at CLEF 2019

                              Youngjun Joo and Inchon Hwang

                             Department of Computer Engineering
                               Yonsei University, Seoul, Korea
                              {chrisjoo12, ich0103}@gmail.com



        Abstract We describe our participation in the PAN 2019 shared task on author
        profiling, determine whether a tweet’s author is a bot or a human, and in case of
        human, identify author’s gender for English and Spanish datasets. In this paper,
        we investigate the complementarities of both stylometry methods and content-
        based methods, putting forward various techniques for building flexible features.
        Acting as a complement to these methods, we investigate an ensemble learning
        method paves the way to improve the performance of AP tasks. Experimental re-
        sults demonstrate that the ensemble method by the combination of the stylometry
        methods and content-based methods can more accurately capture the author pro-
        files than traditional methods. Our proposed model obtained 0.9333 and 0.8352
        of accuracy in the bot and gender identification tasks for English test dataset re-
        spectively.


1     Introduction

The Author profiling (AP) deals with the classification of shared content in order to
predict general or demographic attributes of authors such as gender, age, personality,
native language, or political orientation, among others [17]. Being able to infer an au-
thor’s profile has wide applicability and has proved to be advantageous in many areas
such as marketing, forensics, and security, etc.
    Broadly speaking, the approaches that tackle AP view the task as a multi-class or
single-label classification problem, when the set of the class label is known a priori [20].
Thus, AP is modeled as a classification task, in which automatic detection methods
have to assign labels (e.g., male, female) to objects (texts). Consequently, most work
has been devoted to determining a suitable set of features to deal with the task on the
writing profile of authors. In the 2019 shared AP task on PAN dataset [2], the goal is to
infer whether the author of a Twitter feed is a bot or a human and to profile the author’s
gender in case of human [16]. Both training and test data is provided in two different
languages: English, Spanish.

    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
    In order to predict bot and gender, we propose the complementarities of both sty-
lometry methods and content-based methods, putting forward various techniques for
building flexible features (basic count features, psycholinguistic features, TF-IDF, Doc2vec).
Acting as a complement to these features, we also investigate an ensemble learning
method combining classification methods based on various features and BERT model
paves the way to improve the performance of AP tasks.



2      Related Works


Approaches for predicting an AP can be broadly categorized into two types of methods:
(1) stylometry methods which aim to capture an author’s writing style using different
statistical features (e.g., functional words, POS, punctuation marks, and emoticons), (2)
content-based methods that intend to identify an author’s profiles based on the content
of the text (e.g., bag of words, words n-gram, term vectors, TF-IDF n-grams, slang
words, emotional words), and topics discussed in the text (e.g., topic models such as
LDA, PLSA ) [5]. According to the PAN1 competitions, most successful works for AP
in social media have used combinations of these two kinds of features.
    Every author’s writing style can be used to identify an author’s attributes. In previ-
ous studies, style based features were used to predict the author’s attributes, age, and
gender [3,6,13,18,22]. In these methods, lexical word-based features represent text as a
sequence of tokens forming sentences, paragraphs, and documents. A token can be the
numeric number, alphabetic word or a punctuation mark. Plus, these tokens are used to
statistics such as average sentence length, average word length, a total number of words
and a total number of unique words, etc. Also, character-based features consider the
text as a sequence of characters.
    Content-based methods employ specific words or special content which are used
more frequently in that domain than in other domains [23]. These words can be chosen
by correlating the meaning of words with the domain [8, 23] or selecting from corpus
by frequency or by other feature selection methods [1]. An analysis of information gain
presented in [19] showed that the most relevant features for gender identification are
those related with content words (e.g., linux and office for identifying males, whereas
love and shopping for identifying females).
    Recently, some works have used deep learning models and learning method of rep-
resentations for AP [7, 9, 11, 21]. [7] used the approach based on subword character
n-gram embeddings and deep averaging networks (DAN). [9] used the model consists
of a bi-RNN implemented with a Gated Recurrent Unit (GRU) combined with an Atten-
tion mechanism. [11] proposed two models for gender identification and the language
variety identification of four languages that consist of multiple layers to classify an
author’s profile trait with neural networks.

 1
     https://pan.webis.de/
                  Figure 1. Illustration of text preprocessing and postprocessing.



3      Proposed Approach

3.1     Text Preprocessing and Postprocessing

The preprocessing of the text data is an essential step as it makes the raw text ready for
applying machine learning algorithms to it. The objective of this step is to clean noise
those are less relevant to detect the AP on the texts.
    We, at first, aggregate tweet posts published by an individual user into one docu-
ment before training to alleviate the shortcomings of short texts. In order to utilize most
of the information in text, we perform some transforming tasks of the short texts (i.e.,
XML parsing, contradictions unfolding, text tokenizing, stemming, lemmatization, and
removing stopwords). Also, for utilization and word-level representation on most of
the text information, we perform spell correction for informal words using SymSpell
library1 , word segmentation for splitting hashtags using WordSegment library2 , and an-
notation (surround or replace with special tags such as , , ,
, or ) as a text postprocessing task (see Figure 3).


3.2     Basic Count Features

Previous works on AP tasks explore lexical, syntactic, and structural features. Lexi-
cal features are used to measure the habit of using characters and words in the text.
The commonly used features in this kind consist of the number of characters, word, a
frequency of each type of characters, etc. Syntactic features include the use of punc-
tuations, part-of-speech (POS) tags, and functional words. Structural features represent
how the author organizes their documents or other special structures such as greetings
or signatures.
    As shown Table 1, we construct a basic count feature set including punctuation char-
acters (e.g., question marks, and exclamation marks) and other features (e.g., average
syllable per word, functional word count, special character count, capital ratio, etc.).

 1
     https://github.com/wolfgarbe/SymSpell
 2
     https://github.com/grantjenks/python-wordsegment
                 Table 1. List of basic count features extracted from the Twitter.

          Average sentence lenght by character   Average sentence lenght by word
          Average syllable per word              Average word length
          Capital ratio                          Number of functional words
          Number of special characters           Character count
          Word count                             Hashtag count
          Direct tweet count                     Repeat punctuation count
          Punctuation count                      URL count
          Positive emoji count                   Negative emoji count
          Emoji count                            Slang words count
          Stopword count                          tag count
           tag count                       tag count