=Paper= {{Paper |id=Vol-1179/CLEF2013wn-PAN-DeArteagaEt2013 |storemode=property |title=Author Profiling Using Corpus Statistics, Lexicons and Stylistic Features Notebook for PAN at CLEF-2013 |pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-DeArteagaEt2013.pdf |volume=Vol-1179 |dblpUrl=https://dblp.org/rec/conf/clef/De-ArteagaJDMB13 }} ==Author Profiling Using Corpus Statistics, Lexicons and Stylistic Features Notebook for PAN at CLEF-2013== https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-DeArteagaEt2013.pdf
      Author Profiling Using Corpus Statistics,
          Lexicons and Stylistic Features
                 Notebook for PAN at CLEF-2013

              Maria De-Arteaga, Sergio Jimenez, George Dueñas,
                     Sergio Mancera and Julia Baquero

              Universidad Nacional de Colombia, Bogotá, Colombia
       [mdeg|sgjimenezv|geduenasl|samanceran|jmbaquerov]@unal.edu.co



      Abstract. This paper describes our participation in the 9th PAN eval-
      uation lab in the author profiling task. The proposed approach relies on
      the extraction of stylistic, lexicon and corpus-based features, which were
      combined with a logistic classifier. These three sets of features contain
      pairwise intersections and even some features that belong to all cate-
      gories. A comprehensive comparison of the contribution of several feature
      subsets is presented. In particular, a set of features based on Bayesian
      inference provided the most important contribution. We developed our
      system in the Spanish training corpus, once developed it was used, with
      minor changes, for the English documents, too. The proposed system
      was ranked 6th in the official ranking for Spanish documents among 17
      submitted systems. This result shows that our approach is meaningful
      and competitive for predicting demographics from text.
      Keywords: author profiling, gender prediction, age prediction


1   Introduction

Due to the large amount of textual information on the internet, it is now possible
to carry out different research problems about the texts, either in connection with
their authors, the registers involved, and the varieties of texts, among others. In
the framework of the international conference, CLEF 2013 [9], we focused our
study on the task of predicting demographic information about the authors from
texts written in Spanish or English, by people of different age-range and gender.
    In order to identify the author profile from written texts, the use of stylistic
and content features is a common practice [1, 2, 8, 10]. However others researchers
prefer to focus only on the stylistic features [6]. The function words and part-
of-speech are the main style-base features proposed for distinguishing the gen-
der and age of the authors [1, 2, 6, 8, 10]. Another stylistic features included in
these inventories are: the typical blog features [10], the grammatical and ortho-
graphic errors [1], the morphological, syntactic and structural attributes, and
other stylistic characteristics extracted using the Linguistic Inquiry and Word
Count (LIWC) program [8]. The most common measure employed is the fre-
quency of each feature, normalized or not by the length of the document or
other criteria. Cheng et al. (2011) also includes some measures such Yules K,
Simpsons D, Sichels S, Honores R and entropy.
     The content-based features and the mechanism used for its selection also vary
from one author to another. The extraction of corpus words for its comparison
between the classes of interest [1, 10], and the use of pre-established list of words
[8, 2] are the principal mechanism employed for the selection of this type of
features.
     In our study, each document is represented in a vector space, where each
feature adds one unit to the dimension, including stylistic and lexicon-based
attributes, relevant to distinguish the gender and age-range of the authors. Fur-
thermore, we explore a new subset of features that involve the use of some
statistics measures (corpus statistics features). These three subsets of features,
as shown in Fig. 1, are intersected, and therefore some of them are located in
more than one class. We used a machine-learning approach to build classifica-
tion models to produce the predictions. The details of the task, documents and
evaluation are presented in [9].
     In the remainder of the paper, we begin with a description of the features
(Section 2) and of the system used in this campaign (Section 3). Section 4 focuses
on the main results of our work, while the final sections present the discussion
and the conclusion that can be drawn from this study.


2   Features from Texts
The set of features extracted from each text contains components of one or more
of the following categories: ‘S’ (style), ‘C’ (corpus statistics) and ‘L’ (lexicon).
Fig. 1 shows a Venn diagram depicting the number of features extracted for each
category combination. In the following subsections these features are described
and the labels in Fig. 1 are used to clarify their categories, i.e. ‘SL’ for Style
and Lexicon categories. Besides, the features in the ‘C’ category are presented
separately by their supervised or unsupervised nature.


                          ^ƚLJůĞс^                 ŽƌƉƵƐ^ƚĂƚŝƐƚŝĐƐс


                             ^                ^>               
                         ĞŶсϳϬ        ^>    ĞŶсϲ >          ĞŶсϯϴ
                         ĞƐсϳϬ             ĞƐсϭϮ            ĞƐсϯϴ
                                      ĞƐсϲ         ĞŶсϭϴ
                                      ĞŶсϭϮ         ĞƐсϮϴ
                                                >
                                             ĞƐсϭϴ
                                             ĞŶсϮϴ     >ĞdžŝĐŽŶс>


     Fig. 1. Categories with their number of features by category and language
2.1   Unsupervised Corpus Statistics

This set of 6 features is built from statistics gathered from the training corpus,
ignoring the demographic categories age and gender associated to each docu-
ment. These corpus-based statistics use collection and document frequencies of
the words in the entire training English and Spanish collections. The motiva-
tion for the use of document frequencies is to prevent very long documents from
generating biased results.

IR features (2 ‘C’ features). Using the tf.idf term weighting approach
                                                                   P used
                                                                               idf (w)
                                                                         w∈d
in the information retrieval field we obtained two features: IDF (d) =       len(d)
                    P
                                 tf (w,d)·idf (w)
and T F.IDF (d) = w∈d len(d)         , where len(d) is the number of words in the
                             D
document d, idf (w) = log df (w) , df (w) is the number of documents where the
word w occurs, D is the number of documents in the corpus and tf (w, d) is
the number of times that w occurs in the document d. Tf.idf weight measures
the informative character (for retrieval purpose) of the words given a particular
document and the whole corpus. Thus, these features measure the density of
that notion for each document.

Entropy (2 ‘C’ features) measures the amount of information in a set of ran-
dom variables, i.e. occurrences of words in a document. The probability of oc-
currence of a word is given by Pf (w) ≈ f M    (w)
                                                   where f (w) is the number of
occurrences of w in the corpus, and M is the total number of words in the
corpus. Alternatively, these probabilities can be obtained from document fre-
                          df (w)
          P Pdf (w) ≈ D . Thus, the entropy of a document is given by
quencies by
Hf (d) =      Pf (w) · log2 (Pf (w)). Hdf (d) is obtained with the same formula
           w∈d
but using Pdf (w).

Kullback-Leibler (KL)-divergence (1 ‘C’ feature) measures the information
loss when a document probability distribution Q is used to approximate the
“true” corpus distribution P . The probability Q for a word in a document is
given by Qd (w) ≈ dflen(d)
                      (w,d)
                            . The corpus probability distribution P is given by
           f (w)
Pd (w) ≈ P       . Thus, the KL-divergence of a document is given by Pd k
                  f (v)
            v∈d
                                       
           P                   Pd (w)
Qd (d) =         Pd · ln       Qd (w)       .
           w∈d


Cross entropy (1 ‘C’ feature), similarly to the KL-divergence, compares P and
Q measuring the ability of the former for predicting the latter. The cross
                                                                       P entropy
of a document is given by the following expression: H(Pd , Qd ) = −        Pd (w) ·
                                                                       w∈d
log2 (Qd (w)).
2.2     Supervised Corpus Statistics

Unlike the previous set of features, this collection was built taking into account
the age and gender of the authors of the training documents.

Gender score (2 ‘C’ features). We developed the gender score (GS), a measure
that aggregates the differences between the probabilities of a word w estimated
in the corpus of documents written by males and females. Let Pf (w|male) ≈
fmale (w)
  Mmale be the probability of w estimated only in the corpus written by males,
where fmale (w) is the number of occurrences of w in the “male” subset of the cor-
pus and Mmale is the total number of words in that same subset. Pf (w|f emale)
is calculated
            Panalogously. Thus, GS is given by:
GSf (d) =     (Pf (w|male) − Pf (w|f emale)). GSdf is obtained using Pdf (w|male) ≈
            w∈d
dfmale (w)
  Dmale    where dfmale (w) is the number of documents written by males where w
occurred and Dmale is the total number of documents written by males. Again,
Pdf (w|f emale) is calculated analogously.

Bayes score (10 ‘C’ features). We proposed a score for each one of the five
demographic categories male, female, 10’s, 20’s and 30’s using the
                                                                P Bayes the-
orem. These scores are given by the expression BSf,cat. (d) =      Pf (cat.|w)
                                                                       w∈d
                                                               Pf (w|cat.)·P (cat.))
having cat. ∈ {male, female, 10s, 20s, 30s}, Pf (cat.|w) =            Pf (w)         and
P (cat.) ≈ Dcat.
             D . Similarly, BSdf,cat. is obtained analogously but using probabil-
ities subscripted by df . This way, we obtained 10 features from the 5 categories
(cat.) and the 2 types of probabilities Pf and Pdf .

Supervised KL-divergence (5 ‘C’ features) can also be used to build super-
vised attributes. In this case, it measures the information loss when Qd is used
to predict the probability distribution of the subset of documents written by
authors of the demographic category cat.This probability distribution is given
                  f (w|cat.)
by Pd.cat (w) ≈ P    f (v|cat.)
                                , and the KL divergences are given by P ||Qcat. (d) =
                    v∈d

      Pd.cat · ln Pd.cat (w)
P
                   Qd (w) .
w∈d

Supervised cross entropy (5 ‘C’ features). As it can be expected, cross-
entropy can also be calculated based on probability distributions of each indi-
vidual demographic category. In this case, it measures Phow predictable Q is when
using Pd.cat. . The equation to do so is H(P, Q)cat = −   Pd.cat (w)·log2 (Qd (w)).
                                                        w∈d

Supervised lexicon extraction using T-test (20 ‘CL’ features). The Stu-
dents t-test, frequently used in text mining, allows us to determine the most
characteristic words of each demographic category by measuring the significance
of the differences in the occurrences of the words on each category (gender) or
between the category and the whole corpus (age). We used critical values in the
T-table to build five lexicons, one for each gender and age range. This word
lists contain the words that have an absolute T-value greater than 2 for the
given category, which are equivalent to around the three percent most relevant
words of each demographic group. The construction method is different for gen-
der and
      p age categories. However, the following
                                       p       definitions are used in both cases:
S = Pf (w) − Pf (w)2 and Scat. = Pf (w|cat) − Pf (w|cat)2 . In the gender
T-function, as in the gender score, values greater than zero are characteristic of
males and those less than zero are more often used by female. This value is given
          P (w|male)−Pf (w|f emale)
by Tg = f q         2          2
                                    .
              Smale  Sf emale
              Dmale + Df emale

   Since the comparison cannot be made the same way when having three cat-
egories, a T-function was used for each age range, comparing the category with
the general corpus. This function is given by the following equation, where cat.
                                               P (w|male)−Pf (w)
can only be an age range category: Tcat. = f q                   . This procedure
                                                     Scat. 2 S2
                                                     Dcat. + D

provides 5 lexicons of words characterizing each demographic category.

2.3   Lexicon-based Features

The 5 lexicons built using T-test, as well as other pre-fabricated lexicons are
used to generate 4 features for each one:

Lexical density (1 feature) is the ratio of content words to the total number
of words. Ure, according to Johansson, introduced it in order to distinguish
between words with lexical properties, and those without [5]. The concept of
lexical density is developed by Halliday whose definition is “the proportion of
lexical items to the total words” [4]. If li (d) is the number of words that belong
to the i th lexicon in document d, then LDi (d) = li (d)/len(d).

Weighted density (1 feature). The Spanish Emotion Lexicon [11] and the lists
generated using T-test,
                     Pprovides a weight Ii to every word. Weighted density is
given by: W Di (d) =    Ii (w)/len(d). In lexicons that do not provide weights,
                      w∈d
1 was used as weight.

Lexicon entropy (2 features). We calculate the Pentropy in relation to every
lexicon using the following equation: Hi (d) =    Pf (w) · log2 (Pf (w)). The
                                                 w∈d∩li
fourth feature corresponds to the entropy calculated using Pdf (w).

    The used lexicons and their sources are listed in Table 1. Manual prepro-
cessing was applied to some lexicons by deduplicating and adding the gender
variation for some Spanish words. Twenty ‘CSL’ features result from the 5 T-
test lexicons, the two entropy-related attributes of bad words, Internet and stop-
words add 6 ‘CSL’ features. Similarly, their densities add six ‘SL’ features. For
the remaining lexicons, their entropies generate ‘CL’ features, and their densities
‘L’ features. This generates on 22 ‘L’ and 22 ‘CL’ features for Spanish, and 8 ‘L’
and 8 ‘CL’ features for English.
 Table 1. Websites where the lists of words were obtained (consulted in May 2013)
Lexicon Lang. # Words Source
Bad words en    458       urbanoalvarez.es/blog/2008/04/04/bad-words-list
Bad words es    2,147     rufadas.com/wp-content/uploads/2008/02/malsonantes.pdf
Cooking en/es 885/706 cocina.univision.com/recursos/glosario
Emotions en     3,487     eqi.org/fw.htm
Emotions es     2,036     www.cic.ipn.mx/ sidorov (6 lexicons)
Dictionary es   44,370    openthes-es.berlios.de
Dictionary1 es 14,720     dict-es.sourceforge.net
Internet es     1,567     www.techdictionary.com/chat cont1.html
Internet en/es 689        pc.net/glossary (same lexicon used for both languages)
Legal en        1,011     www.susana-translations.de/legal.htm
Love-sex es     95        www.elalmanaque.com/El Origen de las Palabras
Sports es/en    709/642 www.wikilengua.org/index.php/Glosario de deportes
Stopwords en/es 127/313 NLTK Stopwords Corpus


2.4   Stylistic Features
The stylistic features are classified in three subsets: character-based, wordbased,
and syntactic features. The character-based features contain 50 features, such as
character density, uppercase or lowercase characters, letters, and special char-
acters like the use of asterisk. All of them, except the letter count, have been
used by other researchers for identifying the profile of the author. The word-
based features include 11 measures for vocabulary richness, the length of words
and density of hapax legomenas, dislegomenas, 3-legomenas until 5-legomenas.
Syntactic features involve 9 attributes related to the regular punctuation such
as colon, semicolon and question marks, among others. We also considered as
stylistic features those obtained from lexicons such as stopwords, Internet and
bad words.

3     System Description
The submitted system was built by extracting the features described in the
previous section for each one of the first 20,000 documents in the English and
Spanish training sets. That is, 166 features for English and 198 in Spanish;
the difference is due to the different number of lexicons used on each language.
For obtaining words from the character sequences in the documents xml tags
were removed. Then each consecutive sequence of characters in the English or
Spanish alphabet that was delimited by space, tab, enter or any punctuation
mark, produced a word.
    The statistics used in the calculation of the features that contain the la-
bel C were gathered using all the documents in the training set, i.e. 236,600
documents in English and 75,900 in Spanish. For each word w in the vocabu-
lary we obtained: f (w), fmale (w), ff emale (w), f10s (w), f20s (w), f30s (w), df (w),
dfmale (w), dff emale (w), df10s (w), df20s (w) and df30s (w).
    These datasets were used to train 4 logistic classifiers [7], one for each pair of
target class (age and gender) and language. The used implementation was that
included in Weka v.3.6.9 [3]. The same feature extractor used in the training
data was used to get features from the test documents. Then, the 4 classifiers
provided the age and gender predictions for both languages.

4      Experimental Results

In this section the official results obtained by the proposed system for predicting
authors age and gender in unseen documents are presented in Table 2. To assess
the contribution of the different feature sets, additional experiments were carried
out using a subset comprised of the first 20,000 documents from the training set.
Each feature subset was evaluated using 10-fold cross validation and the average
of ten different random folds is reported. Tables 3 through 6 show the results of
these experiments.

         Table 2. Official results obtained by our submitted system (accuracy)
                     Language Gender Age Total Baseline
                    Spanish (es) 0.5627 0.5429 0.3145 0.1650
                    English (en) 0.4998 0.4885 0.2450 0.1650

    Table 3. Average accuracies for our 3 categories of feature sets an for all features
       Feature Set Gender en           Age en        Gender es         Age es
         Statistic 0.8393(0.0005) 0.7860(0.0013) 0.8038(0.0007) 0.7866(0.0004)
          Lexicon   0.5933(0.0010) 0.6198(0.0003) 0.6261(0.0007) 0.6446(0.0006)
         Stylistic 0.5502(0.0012) 0.6048(0.0003) 0.5981(0.0008) 0.6336(0.0009)
            All     0.8477(0.0023) 0.7809(0.0002) 0.8202(0.0013)          n/a

Table 4. Average accuracies for features obtained either using or not class attributes
     Feature Set   Gender en         Age en        Gender es        Age es
     Supervised 0.8432(0.0003) 0.7968(0.0006) 0.8155(0.0007) 0.7941 (0.0005)
    Unsupervised 0.5487(0.0012) 0.6075(0.0006) 0.5990(0.0005)         n/a

       Table 5. Average accuracy in subcategories in the “statistics” feature set
                       Gender en        Age en       Gender es        Age es
           Bayes     0.7951(0.0004) 0.7382(0.0015) 0.7696(0.0002) 0.7677(0.0003)
       Cross entropy 0.5527(0.0008) 0.5891(0.0006) 0.5376(0.0006) 0.5624(0.0004)
         Kullback 0.5485(0.0005) 0.6034(0.0003) 0.5896(0.0005) 0.5952(0.0007)
        tt Lexicons 0.5863(0.0006) 0.6204(0.0004) 0,6240(0.0005) 0.6377(0.0003)
       Word given X 0.5416(0.0007) 0.6165(0.0003) 0.6152(0.0007) 0.5979(0.0003)

         Table 6. Average accuracies for each lexicon (max. σ = 0.0072)
         Badwords Cooking Dictionary Emotions Internet Legal Love-Sex Sports Stopwords
Gender en 0.5288   0.5257      n/a       0.5267 0.5270 0.5305      n/a   0.5311 0.5304
 Age en    0.5551  0.5673      n/a       0.5593 0.5697 0.5942      n/a   0.5945 0.5934
Gender es 0.5388   0.5041    0.5433      0.5282 0.5187 n/a        0.5361 0.5359 0.5335
 Age es    0.5774  0.5625    0.5800      0.5709 0.5628 n/a        0.5707 0.5676 0.5701
5   Discussion

As shown in Tables 3-6, the best results for distinguishing gender were obtained
in English and Spanish using all features, while the supervised attributes were
better predictors for age-range. The age and gender were more appropriately
identified using statistical features, although they were more suitable for typify-
ing gender. The best statistic predictor in all cases was the features based on the
Bayes theorem. The lexical and stylistic features were more useful to distinguish
age than gender. Finally, the pre-established lists of words do not distinguish
gender although they are useful for discriminating age.

6   Conclusions

We participated in the 9th PAN evaluation campaign with an author profiling
system based on a set of features extracted from documents that were combined
with machine learning. The features were designed in such a way that each one
could contain at least one of the following components: stylometry, usage of
pre-fabricated lexicon and corpus statistics. We developed this system for Span-
ish obtaining the 6th place in the official results among 17 participant systems.
However, the same system adapted for English (replacing Spanish lexicons) per-
formed poorly in unseen documents.
    In a comprehensive comparison of different features we concluded that the
features that provided the larger contribution were the ones obtained from corpus
statistics. Particularly, the proposed score obtained using the Bayes theorem. To
the extent of our readings such (or similar) features have not been used in the
past.


References

[1] Argamon, S., Koppel, M., Pennebaker, J. and Schler, J.: Automatically profiling
   the author of an anonymous text. Communications of the ACM, 52 (2), pp. 119–123
   (2009)
[2] Cheng, N., Chandramouli, R. and Subbalakshmi, K.: Author gender identification
   from text. In: Digital Investigation, Vol 8, N 1, pp 78-88 (2011)
[3] Hall, M., Eibe F., Holmes, G., and Pfahringer, B.. The WEKA data mining soft-
   ware: An update. SIGKDD Explorations, 11(1), pp 10–18 (2009)
[4] Halliday, M. A. K.: Spoken and written language. Geelong Victoria: Deakin Uni-
   versity (1985)
[5] Johansson, V.: Lexical diversity and lexical density in speech and writing: a de-
   velopmental perspective. In: Lung Working Papers in Linguistics, Vol 53, pp 61-79
   (2008)
[6] Koppel, M., Argamon S. and Shimoni A.: Automatically categorizing written texts
   by author gender, Literary and Linguistic Computing 17(4), November 2002, pp.
   401-412 (2008).
[7] le Cessie, S., van Houwelingen, J.C. Ridge Estimators in Logistic Regression. Ap-
   plied Statistics. 41(1):191-201 (1992)
[8] Nguyen D., Smith N. and Ros, C.: Author age prediction from text using linear
   regression. In LaTeCH ’11 Proceedings of the 5th ACL-HLT Workshop on Language
   Technology for Cultural Heritage, Social Sciences, and Humanities, pp 115-123,
   (2011)
[9] Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. An Overview of the
   Traditional Authorship Attribution Subtask. CLEF (2013) (to appear)
[10] Schler, J., Koppel, M., Argamon, S. and Pennebaker, J.: Effects of Age and Gen-
   der on Blogging. In: Proceedings. of AAAI Spring Symposium on Computational
   Approaches for Analyzing Weblogs (2006)
[11] Sidorov, G., Miranda-Jiménez, S., Viveros-Jiménez, F., Gelbukh, A., Castro-
   Sánchez, N., Velásquez, F., Dı́az-Rangel, Suárez-Guerra, S. , Treviño, A., and Gor-
   don J.. Empirical Study of Opinion Mining in Spanish Tweets. LNAI 7629-7630, pp
   1-14 (2012)
[12] Thoiron, P.: Diversity Index and Entropy as measures of lexical richness. In: Com-
   puters and the Humanities, Vol 20, pp 197-202 (1986)