<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-Genre Author Profile Prediction Using Stylometry-Based Approach</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, COMSATS Institute of Information Technology</institution>
          ,
          <addr-line>Lahore</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shaina Ashraf</institution>
          ,
          <addr-line>Hafiz Rizwan Iqbal</addr-line>
          ,
          <country>Rao Muhammad Adeel Nawab</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>4</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Author profiling task aims to identify different traits of an author by analyzing his/her written text. This study presents a Stylometry-based approach for detection of author traits (gender and age) for cross-genre author profiles. In our proposed approach, we used different types of stylistic features including 7 lexical features, 16 syntactic features, 26 character-based features and 6 vocabulary richness (total 56 stylistic features). On the training corpus, the proposed approach obtained promising results with an accuracy of 0.787 for gender, 0.983 for age and 0.780 for both (jointly detecting age and gender). On the test corpus, proposed system gave an accuracy of 0.576 for gender, 0.371 for age and 0.256 for both.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The main concept behind author profiling is to determine the traits of a writer from
his/her written text. We can predict different characteristics of an author by analyzing
his/her written text, for example, age, gender, native language, qualification and
personality traits etc.[1]. The writing style demonstrates the profile of an author and
provides valuable information about his demographics. Identification of these author
traits can be very helpful in different applications e.g. forensics analysis, security,
intelligent marketing decisions, sentiment analysis and classification[2].
In this paper, we present an approach, based on different types of stylistic features. In
total, we applied 56 stylistic features. These features are divided into four categories
including lexical, syntactic, character-based and vocabulary richness measures. The
reason for selecting this methodology is that the training and test datasets are on
different genres i.e. the training has done using Twitter data and the evaluation
performed on other genre different from Twitter tweets. We expect that capturing an
author’s writing style on different types of training and testing data will yield good
results.</p>
      <p>The problem of gender and age identification also treated as a supervised document
classification task. Different machine learning algorithms including J48, Random
Forest and LADTree were explored for classification task. Various feature selection
methods including Best-First and Ranker etc. were also investigated to identify the
subset of best features from the set of 56 features. Best results on the training data
were obtained (using the LADTree machine learning algorithm), where all the 56
features were used for the gender and age identification task. The trained system
deployed on TIRA [11] for final evaluation on test dataset(s). The comparison of our
system with other participants has shown in [12].</p>
      <p>Rest of this paper is organized as follows: Section 2 describes related work. Section 3
describes the proposed approach. Section 4 presents the experimental setup. Section
5 discusses results and their analysis. Finally, Section 6 concludes the paper and
discusses future work directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Previous studies have commonly used Stylometry-based features to identify an
author’s traits from his/her writing style. For example, one of the pioneers in author
profiling [3] explored some linguistic patterns in writing styles of authors which can
be helpful in identifying different author traits like personality attributes, gender and
age group. They carried out Part-Of-Speech tags analysis to get different stylistic
features (i.e. function words, prepositions, pronouns, auxiliary verbs) for
identification of gender of an author and the accuracy of 72% and 66% for gender and
age identification respectively. Argamon [4] et. al identified the demographics of an
author by combining different features i.e. (function words with POS tags and
obtained an accuracy of 80% for gender identification, In [5, 6], authors presented a
set of features like word unigrams, function words, non-dictionary words, hyperlinks
for detection of age and gender of an author. Results showed 80% accuracy for
gender identification and 75% accuracy for age identification. In the previous PAN
Author Profiling Competitions, many submitted systems used stylistic features for
predicting age, gender and personality types [7-9].
3</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <p>Our proposed approach is based on Stylometric (the study of linguistic style) features,
which help us to capture a set of elements of writing. Since the writing style of one
author is likely to be different from others, therefore, these Stylometric features can
be useful in discrimination between an author’s traits. The other reason for selecting
various types of stylistic features is that the training data is in one genre and the test
data is in another genre. Therefore, stylistic features were expected to accurately
identify author traits even if they are trained and tested on different types of data.
Our proposed approach combines different types of stylistic features including
lexical, syntactic, vocabulary richness and character based features. The next sections
describe these feature types in more detail.
3.1</p>
      <sec id="sec-3-1">
        <title>Lexical Features</title>
        <p>Lexical features represent text as a sequence of tokens forming sentences, paragraphs
and documents. A token can be numeric number, alphabetic word of a punctuation
mark. These tokens are used to get statistics like average sentence length and average
word length [5]. These features have the ability to get insights of a text in any
language without special requirements. In our proposed system, we have implemented
7 lexical features: (1) average sentence length in characters, (2) average sentence
length in words, (3) average word length, (4) percentage of question sentences, (5)
total number of words, (6) total unique words and (7) words ratio of length 3.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Syntactic Features</title>
        <p>Syntactic features consist of function words and parts-of-speech tags. Syntactic
pattern varies significantly from one author to another. These features were extracted
using more accurate and robust text analysis tools i.e. Part-of-speech taggers,
chunckers and lemmatizers. In our proposed system, for the extraction of syntactic
features, we have used Stanford Log-linear Part-Of-Speech Tagger1. The proposed
approach contains 16 syntactic features: (1) number of adjectives, (2) number of
nouns, (3) number of adverbs, (4) number of verbs, (5) number of cardinal number,
(6) number of preposition, (7) number of particle, (8) number of symbol, (9) number
of conjunction, (10) number of determiner, (11) number of Interrogative, (12)
number of foreign words, (13) number of pronoun, (14) POS unigram density (see
Equation 1), (15) POS bigram density (see Equation 2), (16) POS trigram density (see
Equation 3).</p>
      </sec>
      <sec id="sec-3-3">
        <title>Vocabulary Richness</title>
        <p>Every piece of text is composed of a set of unique words called its vocabulary.
Vocabulary richness functions try to measure the diversity of vocabulary in a given
1http://nlp.stanford.edu/software/tagger.shtml Last visited: 25-05-2016
text i.e. how rich is the vocabulary [10]. Easiest and common example to understand
vocabulary richness is hapax-legomena (number of words occurring exactly once) and
type-token ratio i.e. V/N - where V is number of unique words in the text and N is the
total number of words in the same text. Size of text/document directly affects the
vocabulary size i.e. smaller documents will have less number of unique words while
the larger ones will have higher number of unique words. To cater the influence of
text size of vocabulary richness measures, a number of formulas have been used. In
our proposed system, we have implemented 6 vocabulary richness measures (see
apostrophe count, (9) ratio of upper case letters, (10) brackets count, (11) ratio of
white-spaces to N (total no of characters in an author profile), (12) colon count, (13)
ratio of tabs to N, (14) comma count, (15) ratio of special character to N, (16) dash
count, (17) number of upper case characters, (18) ellipsis count, (19) digit count, (20)
exclamation count, (21) number of white-spaces, (22) full-stop count, (23) number of
tabs, (24) question-mark count, (25) semicolon count, (26) slash count</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup:</title>
      <sec id="sec-4-1">
        <title>Training Corpus</title>
        <p>We have used pan16-training-dataset-english to train our proposed system (we did
not attempt author-profiling task for other languages i.e. Dutch and Spanish). The
training corpus for English language is composed of Twitter tweets and contains 436
author profiles (see Table 4.1 for detailed statistics). The goal is to identify two author
traits: (1) gender and (2) age. Gender identification task aims to discriminate between
two classes: (1) male and (2) female, whereas age identification task aims to
discriminate between five classes: (1) 18-24, (2) 25-34, (3) 35-49, (4) 50-64 and
65xx.
We pre-processed both training and test datasets by removing xml tags, html tags etc.
and only used plain text for experimentation.
The task of identifying an author’s gender and age from his/her text is casted as a
supervised document classification task. For gender identification, we have performed
binary classification task i.e. goal is to distinguish between two classes: (1) male and
(2) female. For age identification, we have performed multi-classification task i.e.
goal is to categorize age among five classes: (1) 18-24 (2) 25-34 (3) 35-49 (4) 50-64
(5) 65-xx. We have used 10-fold cross validation for experiments. We explored
multiple classifiers including J48, Random Forest, LADTree, to train and test our
proposed system. The numeric values generated by 56 different Stylometry features
(see Section 3) were used as input to these classifiers. Evaluation is carried out using
accuracy measure for both age and gender identification tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Analysis</title>
      <p>On the final evaluation corpus (pan16-test-dataset2-english), our proposed approach
obtained an accuracy of 0.371 for age, 0.576 for gender and 0.256 for both. It can be
noticed that these results are very low compared to the training corpus. The possible
reason for this is that proposed system is trained on one genre (tweets) and it is tested
on another genre (blogs, reviews, social media etc.). Also the effect of evaluation on a
test dataset with different genre as that of training dataset is reflected in the difference
of accuracy scores for training and test datasets. The proposed system gives very high
accuracy on age (0.983) and it drops to 0.371 on test dataset. On the other hand, the
accuracy for gender on training dataset is low as compared to age, but it is high on the
test dataset. This clearly shows that models trained on one genre may not give the
same pattern of performance if they are evaluated on a data set, which contains author
profiles from a different genre.
2 Note that we also applied feature selection on the set of 56 features but it did not improve
performance. Best results were obtained when all the 56 features were used for age and gender
identification</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented an approach based on different types of stylistic features
for identifying two author traits i.e. gender and age. The proposed system contains
total 56 stylistic features including 7 lexical features, 16 syntactic features, 26
character-based features and 6 vocabulary richness measures. The system was trained
using all the 56 features and different machine learning algorithms were explored
including Random Forest, J48 and LADTree. Using the proposed approach, promising
results were obtained on the training dataset (0.983 for age, 0.787 for gender and
0.780 for both (jointly identifying age and gender)). On the test data set, the proposed
approach obtained accuracy of 0.371 for age, 0.576 for gender and 0.256 for both.
In future, we plan to combine other features, for example, content based, topic based
etc., with stylistic features for cross-genre author profiling task.</p>
      <p>References
1.
2.
3.
4.
5.
6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Guthrie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guthrie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>An Unsupervised Approach for the Detection of Outliers in Corpora</article-title>
          .
          <source>LREC</source>
          (
          <year>2008</year>
          )
          <string-name>
            <surname>Abbasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace</article-title>
          .
          <source>In: ACM Transactions on Information Systems (TOIS)</source>
          .
          <volume>26</volume>
          (
          <issue>2</issue>
          ): p.
          <volume>7</volume>
          (
          <issue>2008</issue>
          ) Argamon,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            ,
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Automatically profiling the author of an anonymous text</article-title>
          .
          <source>In: Communications of the ACM</source>
          .
          <volume>52</volume>
          (
          <issue>2</issue>
          ): p.
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          (
          <year>2009</year>
          ) Argamon,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Fine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Shimoni</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. R.</surname>
          </string-name>
          : Gender, genre, and
          <article-title>writing style in formal written texts</article-title>
          . In:
          <string-name>
            <surname>Text-The Hague</surname>
          </string-name>
          Then Amsterdam Then Berlin.
          <volume>23</volume>
          (
          <issue>3</issue>
          ): p.
          <fpage>321</fpage>
          -
          <lpage>346</lpage>
          (
          <year>2003</year>
          )
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>In: Journal of the American Society for information Science and Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <volume>60</volume>
          (
          <issue>3</issue>
          ): p.
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fakotakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kokkinakis</surname>
          </string-name>
          , G.:
          <article-title>Computer-based authorship attribution without lexical measures</article-title>
          .
          <source>In: Computers and the Humanities.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <volume>35</volume>
          (
          <issue>2</issue>
          ): p.
          <fpage>193</fpage>
          -
          <lpage>214</lpage>
          (
          <year>2001</year>
          )
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          . In: CLEF 2015 Working Notes.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>CEUR (2015) Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trenkmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daeleman</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd author profiling task at PAN 2014</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>In: CLEF Evaluation Labs</article-title>
          and
          <string-name>
            <surname>Workshop (2014) Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inches</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Overview of the author profiling task at PAN 2013</article-title>
          .
          <article-title>In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation CELCT (</article-title>
          <year>2013</year>
          ) Toutanova,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Enriching the knowledge sources used in a maximum entropy part-of-speech tagger</article-title>
          . In:
          <article-title>Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics</article-title>
          -Volume
          <volume>13</volume>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2000</year>
          ) Gollub,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Burrows</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          : TIRA: Configuring, Executing, and
          <article-title>Disseminating Information Retrieval Experiments</article-title>
          . In: Tjoa,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Liddle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Schewe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.D.</given-names>
            ,
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>X</surname>
          </string-name>
          . (eds.) 9th
          <source>International Workshop on Text-based Information Retrieval (TIR 12) at DEXA</source>
          . pp.
          <fpage>151</fpage>
          -
          <lpage>155</lpage>
          . IEEE, Los Alamitos, California (
          <year>2012</year>
          )
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Evaluations Concerning Cross-genre Author Profiling</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>