=Paper=
{{Paper
|id=Vol-2380/paper_174
|storemode=property
|title=Author Profiling on Social Media: An Ensemble Learning Approach using Various Features
|pdfUrl=https://ceur-ws.org/Vol-2380/paper_174.pdf
|volume=Vol-2380
|authors=Youngjun Joo,Inchon Hwang
|dblpUrl=https://dblp.org/rec/conf/clef/JooH19
}}
==Author Profiling on Social Media: An Ensemble Learning Approach using Various Features==
https://ceur-ws.org/Vol-2380/paper_174.pdf
Author Profiling on Social Media: An Ensemble
Learning Model using Various Features
Notebook for PAN at CLEF 2019
Youngjun Joo and Inchon Hwang
Department of Computer Engineering
Yonsei University, Seoul, Korea
{chrisjoo12, ich0103}@gmail.com
Abstract We describe our participation in the PAN 2019 shared task on author
profiling, determine whether a tweet’s author is a bot or a human, and in case of
human, identify author’s gender for English and Spanish datasets. In this paper,
we investigate the complementarities of both stylometry methods and content-
based methods, putting forward various techniques for building flexible features.
Acting as a complement to these methods, we investigate an ensemble learning
method paves the way to improve the performance of AP tasks. Experimental re-
sults demonstrate that the ensemble method by the combination of the stylometry
methods and content-based methods can more accurately capture the author pro-
files than traditional methods. Our proposed model obtained 0.9333 and 0.8352
of accuracy in the bot and gender identification tasks for English test dataset re-
spectively.
1 Introduction
The Author profiling (AP) deals with the classification of shared content in order to
predict general or demographic attributes of authors such as gender, age, personality,
native language, or political orientation, among others [17]. Being able to infer an au-
thor’s profile has wide applicability and has proved to be advantageous in many areas
such as marketing, forensics, and security, etc.
Broadly speaking, the approaches that tackle AP view the task as a multi-class or
single-label classification problem, when the set of the class label is known a priori [20].
Thus, AP is modeled as a classification task, in which automatic detection methods
have to assign labels (e.g., male, female) to objects (texts). Consequently, most work
has been devoted to determining a suitable set of features to deal with the task on the
writing profile of authors. In the 2019 shared AP task on PAN dataset [2], the goal is to
infer whether the author of a Twitter feed is a bot or a human and to profile the author’s
gender in case of human [16]. Both training and test data is provided in two different
languages: English, Spanish.
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
Switzerland.
In order to predict bot and gender, we propose the complementarities of both sty-
lometry methods and content-based methods, putting forward various techniques for
building flexible features (basic count features, psycholinguistic features, TF-IDF, Doc2vec).
Acting as a complement to these features, we also investigate an ensemble learning
method combining classification methods based on various features and BERT model
paves the way to improve the performance of AP tasks.
2 Related Works
Approaches for predicting an AP can be broadly categorized into two types of methods:
(1) stylometry methods which aim to capture an author’s writing style using different
statistical features (e.g., functional words, POS, punctuation marks, and emoticons), (2)
content-based methods that intend to identify an author’s profiles based on the content
of the text (e.g., bag of words, words n-gram, term vectors, TF-IDF n-grams, slang
words, emotional words), and topics discussed in the text (e.g., topic models such as
LDA, PLSA ) [5]. According to the PAN1 competitions, most successful works for AP
in social media have used combinations of these two kinds of features.
Every author’s writing style can be used to identify an author’s attributes. In previ-
ous studies, style based features were used to predict the author’s attributes, age, and
gender [3,6,13,18,22]. In these methods, lexical word-based features represent text as a
sequence of tokens forming sentences, paragraphs, and documents. A token can be the
numeric number, alphabetic word or a punctuation mark. Plus, these tokens are used to
statistics such as average sentence length, average word length, a total number of words
and a total number of unique words, etc. Also, character-based features consider the
text as a sequence of characters.
Content-based methods employ specific words or special content which are used
more frequently in that domain than in other domains [23]. These words can be chosen
by correlating the meaning of words with the domain [8, 23] or selecting from corpus
by frequency or by other feature selection methods [1]. An analysis of information gain
presented in [19] showed that the most relevant features for gender identification are
those related with content words (e.g., linux and office for identifying males, whereas
love and shopping for identifying females).
Recently, some works have used deep learning models and learning method of rep-
resentations for AP [7, 9, 11, 21]. [7] used the approach based on subword character
n-gram embeddings and deep averaging networks (DAN). [9] used the model consists
of a bi-RNN implemented with a Gated Recurrent Unit (GRU) combined with an Atten-
tion mechanism. [11] proposed two models for gender identification and the language
variety identification of four languages that consist of multiple layers to classify an
author’s profile trait with neural networks.
1
https://pan.webis.de/
Figure 1. Illustration of text preprocessing and postprocessing.
3 Proposed Approach
3.1 Text Preprocessing and Postprocessing
The preprocessing of the text data is an essential step as it makes the raw text ready for
applying machine learning algorithms to it. The objective of this step is to clean noise
those are less relevant to detect the AP on the texts.
We, at first, aggregate tweet posts published by an individual user into one docu-
ment before training to alleviate the shortcomings of short texts. In order to utilize most
of the information in text, we perform some transforming tasks of the short texts (i.e.,
XML parsing, contradictions unfolding, text tokenizing, stemming, lemmatization, and
removing stopwords). Also, for utilization and word-level representation on most of
the text information, we perform spell correction for informal words using SymSpell
library1 , word segmentation for splitting hashtags using WordSegment library2 , and an-
notation (surround or replace with special tags such as , , ,
, or ) as a text postprocessing task (see Figure 3).
3.2 Basic Count Features
Previous works on AP tasks explore lexical, syntactic, and structural features. Lexi-
cal features are used to measure the habit of using characters and words in the text.
The commonly used features in this kind consist of the number of characters, word, a
frequency of each type of characters, etc. Syntactic features include the use of punc-
tuations, part-of-speech (POS) tags, and functional words. Structural features represent
how the author organizes their documents or other special structures such as greetings
or signatures.
As shown Table 1, we construct a basic count feature set including punctuation char-
acters (e.g., question marks, and exclamation marks) and other features (e.g., average
syllable per word, functional word count, special character count, capital ratio, etc.).
1
https://github.com/wolfgarbe/SymSpell
2
https://github.com/grantjenks/python-wordsegment
Table 1. List of basic count features extracted from the Twitter.
Average sentence lenght by character Average sentence lenght by word
Average syllable per word Average word length
Capital ratio Number of functional words
Number of special characters Character count
Word count Hashtag count
Direct tweet count Repeat punctuation count
Punctuation count URL count
Positive emoji count Negative emoji count
Emoji count Slang words count
Stopword count tag count
tag count tag count