=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Pan-LopezMonroyEt2014
|storemode=property
|title=Using Intra-Profile Information for Author Profiling
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-LopezMonroyEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/Lopez-MonroyMEP14
}}
==Using Intra-Profile Information for Author Profiling==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-LopezMonroyEt2014.pdf</pdf>
<pre>
    Using Intra-Profile Information for Author Profiling
                        Notebook for PAN at CLEF 2014

                  A. Pastor López-Monroy, Manuel Montes-y-Gómez,
                    Hugo Jair Escalante, and Luis Villaseñor-Pineda

          Laboratory of Language Technologies, Department of Computer Science,
                  Instituto Nacional de Astrofísica, Óptica y Electrónica,
                 Luis Enrique Erro No. 1, C.P. 72840, Pue. Puebla, México
                  {pastor, mmontesg, hugojair, villasen}@ccc.inaoep.mx


       Abstract In this paper we describe the participation of the Laboratory of Lan-
       guage Technologies of INAOE at PAN 2014. We address the Author Profiling
       (AP) task finding and exploiting relationships among terms, documents, profiles
       and subprofiles. Our approach uses the idea of second order attributes (a low-
       dimensional and dense document representation) [4], but goes beyond incorpo-
       rating information among each target profile. The proposed representation deepen
       the analysis incorporating information among texts in the same profile, this is, we
       focus in subprofiles. For this, we automatically find subprofiles and build docu-
       ment vectors that represent more detailed relationships of documents and subpro-
       files. We compare the proposed representation with the standard Bag-of-Terms
       and the best method in PAN13 using the PAN 2014 corpora for AP task. Results
       show evidence of the usefulness of intra-profile information to determine gender
       and age profiles. According to the PAN 2014 official results, the proposed method
       was one of the best three approaches for most social media domains. Particularly,
       it achieved the best performance in predicting age and gender profiles for blogs
       and tweets in English.


Keywords: Age Identification, Gender Identification, Subprofiles Generation, Subclass
Information


1   Introduction
For several years the scientific community has been discussing the following basic ques-
tion: How much information can be known from an author’s document?, Commonly
known as the Author Profiling (AP) [3,6,1], this task is of great interest because of its
wide applicability to problems in different areas, such as: business intelligence, criminal
law, computer forensics, etc.
    In this paper, we use well-known textual features for AP [3,6], but we focus in the
representation of documents, exposing its key role in the problem. For this, we mainly
consider the Second Order Attributes (SOA) proposed in [4], SOA is an approach that
builds document vectors in a space of profiles. Under this representation, each value
in the vector represents the relationship of each document with each target profile.
Notwithstanding the usefulness of the approach in [4,5], the method has an evident


                                             1116
shortcoming: this approach basically assumes that the relationship between a vocabu-
lary term (i.e., a word) and a group of authors (a target profile) can be represented using
one single value. For example, in López-Monroy et. al. (2013), the representation of
the word linux is highly related to the male profile, therefore its occurrence in a given
document, causes a dramatically increase in the probability of belonging to the male
profile. This behaviour could make difficult to identify the correct profile for some au-
thors (e.g., classifying documents belonging to females writing about technology). We
believe that such assumption is in some extend naive and can be alleviated through the
automatic generation of subprofiles.
    In this work, we generate new highly informative attributes that represents relation-
ships among terms, documents, profiles and also subprofiles. In order to automatically
generate the aforementioned subprofiles we propose dividing each target profile into
several groups using a clustering algorithm. Then we build the final document represen-
tation on the top of the generated subprofiles, using them as the new target profiles. This
approach improves even more the representation of documents in the AP task, and also
mitigates common problems of other standard representations (e.g. the Bag-of-Terms,
BOT), for example: i) high dimensionality, and ii) the sparseness of the representation.
Results using the latter ideas also seem promising and competitive compared with other
approaches and systems in PAN 2014 forum.
    The rest of this paper is organized as follows: Section 2 introduces the proposed
approach, Section 3 explains the evaluation and the obtained results, finally Section 4
outlines the conclusions.


2     Computing Intra-Profile Relationships

The proposed method has three main stages to represent documents: i) representing
terms in a space of profiles, ii) representing documents in a space of profiles, and iii)
generating subprofiles and re-compute steps 1 and 2 using subprofiles as the new target
profiles. The rest of this section explains the above steps in detail.


2.1   Representing terms in a space of profiles

The intuitive idea is to capture the relationship of each term (i.e., a word, an n-gram,
a punctuation mark, etc.) with each one of the target profiles. Let {t1 , . . . , tm } be the
vocabulary in the corpus, and {p1 , . . . , pn } be the set of target profiles. We build term
vectors ti = htpi1 , . . . , tpin i, where tpij represents the relationship of the term ti with
the profile pj . Equation 1 reflects the idea for computing tpij .

                                                                      
                                     X                        tf ki
                           wij =              log2       1+                                (1)
                                                            len(dk )
                                   k:dk ∈Pj


    where Pj are training documents with profile pj , tfik is the term frequency of the
term ti in the document dk , and len(dk ) is the number of terms in the document dk . To
avoid high cumulative term frequencies in high unbalanced data, a normalization that


                                               1117
considers the proportion of each term in each profile is performed (Equations 2.1 and
2.2).

                                wij                                 wij
               (2.1) tpij = T ERM S               (2.2) tpij = P ROF ILES              (2)
                              X                                    X
                                      wij                                   wij
                               i=1                                j=1


2.2    Representing Documents in a space of profiles

The intuitive idea is to capture the relationship of each document (i.e., a blog, a tweet,
a review etc.) with each one of the target profiles. For this we use the previously com-
puted term vectors to build document vectors in a space of profiles. Thus, to build the
representation of each document we add its term vectors weighted by their frequency
tfkj . Thus, we build document vectors dk = hdp1k , . . . , dpnk i, where n is number pro-
files, and dpik reflects the relationship of the document dk and the profile pi . Equation
(3) expose the latter ideas.


                                       X          tfik
                                dk =                     × ti                          (3)
                                                len(dk )
                                       ti Dk


      where Dk is the set of terms that belongs to document dk


2.3    Generating subprofiles

The latter ideas generate vectors where each value represents the relationship between
a document and each target profile. The intuitive idea is that the representation assumes
certain homogeneity in documents belonging to the same target profile, then a single re-
lationship value per profile is computed [4]. In spite of the usefulness of this assumption
[4,5], it is in some extend naive, because even when a group of authors could share the
same general profile (e.g., females), there could be more specific subgroups of females
with finer differences (say, young-gamer females and housewife females). Thus, each
target profile is in some extend heterogeneous among its authors.
    Generating subprofiles involves discovering natural subgroups among authors be-
longing to the same profile. In this regard, we decide to use the latter generated docu-
ment representation to build document vectors in a space of profiles, then cluster docu-
ments in the same profile. The intuitive idea of this approach, is to use an appropriated
base representation for AP to find documents similar to each other inside that space.
Once a set of clusters (subprofiles) for each target profile are generated, we rebuilt the
SOA using all found clusters as the new target profiles. In this way, as indicated in
formula (3), we end up with a set of attributes that represents relationships between
documents and detailed subprofiles. In order to build the aforementioned subprofiles,
we have used the Expectation Maximization Clustering (EMC) algorithm.


                                            1118
3   Experimental Results

We approached the AP problem in a separated way: i) age, and ii) gender prediction.
Thus, we have five age profile classes; 18-24, 25-34, 35-49, 50-64, and 65-more. Also
we have two gender profiling classes; male, female. Given this context, we build sub-
profile attributes for age, and different subprofile attributes for gender, then we train two
classifiers, one for each representation. In order to evaluate and compare this proposal,
we have used the following experimental settings for the training dataset: i) the most
3,000 frequent terms as features, and ii) the standard LibLINEAR classifier without any
parameter optimization [2]. As terms we use words, contractions, words with hyphens,
punctuation marks and a set of common slang vocabulary. From Table 1 it can be seen
how the proposed approach (n SOA per profile) outperforms BOT and the best PAN 13
approach [4] (1-SOA per profile), using the PAN 14 corpus over different social media
domains. We believe this is because finding subprofiles in the target profiles (10-fold
cross validation over the training dataset was performed), provides a more detailed per-
spective for documents. In this regard, n-SOA is a novel representation that capture
more important details about profiles and subprofiles, in contrast to 1-SOA proposed in
[4], which captures more general information of profiles.


Table 1. Accuracy prediction for BOT, 1-SOA [4], and the proposed n-SOA using the train (under
a 10 Cross Fold Validation) and test datasets, for age and gender profiles in PAN14 English
corpora.

                      Age and Gender prediction in English copora
                                  Blogs      Twitter  Social Media Reviews
       Dataset Representation Age Gender Age Gender Age Gender Age Gender
               BoT            45.57 73.87 39.21 71.52 34.30 54.29 31.17 64.87
       Train 1-SOA            46.72 75.44 43.52 70.52 35.81 55.01 32.63 66.75
               n-SOA          48.07 77.96 47.97 71.98 37.00 55.36 33.92 68.05
       Test    n-SOA          39.74 67.95 49.35 72.08 35.52 52.37 33.37 68.09


Table 2. Accuracy prediction for BOT, 1-SOA [4], and the proposed n-SOA using the train (under
a 10 Cross Fold Validation) and test datasets, for age and gender profiles in PAN14 Spanish
corpora.

                      Age and Gender prediction in Spanish corpora
                                         Blogs      Twitter   Social Media
              Dataset Representation Age Gender Age Gender Age Gender
                      BoT            43.18 62.50 39.88 62.60 37.65 63.83
              Train 1-SOA            45.33 62.91 41.54 62.01 38.88 64.47
                      n-SOA          48.22 63.05 43.61 62.51 41.42 65.35
              Test    n-SOA          48.21 58.93 53.33 60.00 45.23 64.84


                                            1119
    According to the PAN14 evaluation, the proposed attributes, get the best test-set
accuracy performance for age and gender prediction in blogs and tweets domains for
English language. Moreover, the reported results are in the top 3 positions for other
social media domains. Thus, the approach presented in this paper is an effective al-
ternative to address the AP task in different social media domains, where documents
presents challenging difficulties hindered the accurate work of most natural language
processing tools.

4    Conclusions
In this paper we presented a novel approach that considers the existing information
among documents belonging to the same class. This is, even for authors belonging to
the same target profiles (e.g., males), the approach look for more specific subgroups of
authors (e.g., male employees and male gamers) in order to consider intra-profile infor-
mation. To the best of our knowledge, this is the first time that AP is addressed using
this kind of intra-class-relationships inside the target profiles. Such relationships help to
achieve a better discrimination among several profiles. Using these automatically gen-
erated attributes, the classifier can keep good classification rates, even for imbalanced
data. This is due to the relations among terms, documents and subprofiles, which pro-
vides few but more detailed predictive attributes. We have shown better experimental
results than the standard BOT, the best method of PAN13, and most of the approaches
participating at PAN14.


Acknowledgements: This work was partially funded by project CONACyT-Mexico
134186 and the program ECOS under the project M11-H04. López-Monroy also thanks
for doctoral scholarship CONACyT-México 243957.


References
1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of
   an anonymous text. Communications of the ACM 52(2), 119–123 (2009)
2. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large
   linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
3. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author
   gender. Literary and Linguistic Computing 17(4), 401–412 (2002)
4. Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villasenor-Pineda, L.,
   Villatoro-Tello, E.: Inaoe’s participation at pan’13: Author profiling task. In: Notebook
   Papers of CLEF 2013 LABs and Workshops, Valencia, Spain, September (2013)
5. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author
   profiling task at pan 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers
   of the CLEF 2013 Evaluation Labs, September 2013 (2013)
6. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging.
   In: Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for
   Analyzing Weblogs. pp. 199–205 (2006)


                                             1120

</pre>