<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>y Ubiquitous Knowledge Processing Lab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Technische Universität Darmstadt</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>Would you target your audience differently, knowing the real age and gender of the text authors on your website forum? This paper examines hundreds of thousands of online documents, e.g. chat lines or blog posts, showing that computers are capable to address this task better than humans, without relying on content stereotypes. Pointing out that age and gender profiling are not independent problems, we approach the task as a multiclass classification problem, combining the age and gender information to define six classes. Utilizing a wide range of stylistic and content features and a large number of readability measures we demonstrate the high predictive abilities of the parts of speech, the punctuation and the amount of emotions and slang used in the text, independently of the topic discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The author profiling task aims at revealing certain categorical information about the
author, rather than reveal his/her exact identity. Such categories can be his/her age, his/her
gender, but also the native country, degree of education or other socio-demographic
information. Beside its obvious applications in marketing, author profiling can be
beneficial also in the educational domain, e.g. in large scale screenings of pupils, where it
can help to reveal the exceptional talents. It can also help to estimate the appropriate
knowledge level of the audience in an educational forum.</p>
      <p>The PAN challenge task targeted the prediction of age and gender of a document
author. Training corpora were provided for the English and Spanish language. They
consisted of XML documents containing blog posts or chat messages (HTML format)
grouped into one document per author and labelled with his/her language, gender and
age group. The final software had to be ran on an assigned virtual machine, having a
single CPU with 4 GB of RAM.</p>
      <p>This paper presents our classification approach and implemented features, after
which we discuss our experimental results for age and gender separately and combined.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Studying gender differences and comparing them to social stereotypes has been a
popular task in many psychological studies of 20th century [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Traditional studies
worked on small datasets, which often led to contradictory results (see e.g. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] v. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]).
The majority of the studies agree, that there are two main feature groups to distinguish
gender - stylistic and content-based. The first detailed gender study in a larger scale
was performed by Newman, Pennebaker et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] on 14,324 samples from 70
different studies (conversation, exams, fiction etc.). According to them, women are more
likely to include pronouns, verbs, negations, references to home, family, friends and to
various emotions. Men tend to use longer words, more articles, prepositions and
numbers. Men also swear more often and discuss current concerns (e.g. money, leisure or
sports). Schler, Koppel et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] apply machine learning techniques to a corpus of
37,478 blogs from blogger.com. Using classes of content words from the LIWC
Framework [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] extended by blog slang (words and abbreviations such as LOL or OMG) and
style-related features such as part-of-speech (POS) and function words, they were able
to obtain an accuracy of 80% for gender and 76% for age, based on the Multi-Class
Real Winnow classification algorithm. They found differences in topics which men and
women discuss, as well as the increasing number of prepositions and determiners with
age, together with the decreasing number of pronouns and negations. They report that
the usage of hyperlinks increases with age. In another publication [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] they reach 72%
accuracy for gender and 67% for age on the same corpus, using only stylistic features
POS tags, function words and contracted words without apostrophe (im, dont...). Koppel
et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] also analyse gender differences based on 566 fiction and non-fiction
documents from the British National Corpus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], using POS n-grams and function words.
They reach an accuracy of 77%, however training on fiction and testing on non-fiction
does not beat the 50% random baseline. Heylighen and Dewaele [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduce a
contextuality measure based on the proportion of formal parts of speech (nouns, verbs...) to
informal ones (adverbs, interjections...). Corney et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduce emotionally intense
adjective and adverb endings, based on the assumption that women use more emotional
words such as fabulous or awfully.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Corpus properties</title>
      <p>sites of various travel agencies1. The third problem relates to the spam in the data.
In more than 20,000 English documents, words from the WordPress Codex spamlist2
constitute at least 0.1% of all document words - meaning that if any of these document
texts appeared in the comments under a WordPress blog with an active spamlist, it
would be quarantined for manual spam moderation by the blog administrator.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Our Approach</title>
      <p>
        We combine age and gender information to create six separate document classes and
perform multiclass classification, using the one-against-all training approach. Certain
stylistic features can be highly predictive both of gender and of age, which makes it
necessary to determine both gender and age at the same time in the classification. For
example smileys, as illustrated on Figures 1(a)-1(d). This correlation was previously
observed also by Schler and Koppel [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Our system builds upon the Darmstadt Knowledge Processing Software Repository
(DKPro Core)3 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], an open-source Natural Language Processing framework based
on Apache Unstructured Information Management Architecture (UIMA)4. The system
uses the DKPro Lab [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] framework to combine NLP components into pipelines. We
preprocess the data using the TreeTagger [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] for POS tagging and lemmatization for
both languages. For English we additionally use the Stanford Named Entity Recognizer
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Having experimented with the SVM with the polynomial (1,2,3) and RBF kernel,
and with the Updateable Naive Bayes classifier, we trained the final system using
logistic regression with an unlimited number of iterations and with the ridge estimators
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] in their default configuration in the Weka [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] machine learning framework. While
SVM performed the best on a small training set (6,000 documents), the computational
complexity of the training was growing too fast. Since the system should deal with
unknown test sets, we preferred to sacrifice some performance for scalability.
      </p>
      <p>To select the training subsets, we first eliminate the documents whose text consists
of more than 0.1% spam words. From the remaining data, we randomly select 5000
documents for each age and gender combination, thus obtaining 30,000 training
doc</p>
      <sec id="sec-4-1">
        <title>1 e.g. http://www.ganesha-holidays.com/madhya.html 2 http://codex.wordpress.org/Spam_Words 3 http://code.google.com/p/dkpro-core-asl/ 4 http://uima.apache.org/</title>
        <p>Age group
uments for English and 22,500 training documents for Spanish (the corpus contained
2,500 Spanish teenage authors only).
4.1</p>
        <sec id="sec-4-1-1">
          <title>Experiments</title>
          <p>We compare our results to the majority class baseline - accuracy 0.5 for gender and 0.33
for each of the three age classes (0.17 combined). Although the provided data contained
41%, resp.57%, of authors over 30 years, our training sets have more equally distributed
instances, as finding the outliers in author profiles is often more useful in practice.</p>
          <p>To measure an approximative human performance on the corpus, we conducted a
user study. Twenty randomly selected English documents, containing about 500 words
each, were evaluated by 15 participants. We measured the accuracy based on the
majority vote. The confusion matrix is displayed in Table 2. Human participants reached
an overall accuracy of 25% (50% on determining gender and 55% on determining age),
while simple majority class baseline would result in 55% accuracy on gender and 50%
on age. They assigned majority of the texts to authors in their 30’s. This could possibly
suggest that teenage authors may copy their blog content from other sources, or that
they do not give correct age information about themselves.</p>
          <p>
            We divide the features to five classes described below. We use the Information Gain
feature selection approach [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] to rank and prune the feature space, using the top 1500
features.
          </p>
          <p>Surface features To capture the surface properties of text, we measure the length of
documents, sentences and words and their proportions to each other. We also count the
ratio of words longer than five letters and words shorter than three letters compared
to all words, and we count the occurrence of web links and smileys. Several further
features are extracted using regular expressions, such as words with repetitive letters
(e.g. cooool, wooow), words with numbers (e.g.w8, ton8) and number patterns such as
phone numbers.</p>
          <p>Syntactic features and punctuation Syntactic features constitute the majority of all
features, as they proved helpful in previous work and at the same time are conveniently
robust to be used across corpora and languages. We extract POS unigrams, bigrams,
trigrams and quadrigrams as well as the ratio of each POS type separately. We implement
the contextuality measure (Heylighen and Deweale, 2002), comparing impliciteness
and expliciteness of the text based on POS tags used: F = (noun frequency + adjective
freq. + preposition freq. + article freq. - pronoun freq. - verb freq. - adverb freq. -0
interjection freq. + 100) * 0.5 .</p>
          <p>We measure the proportion of singular and plural nouns, proper nouns and pronouns
(both together and separately), as well as the ratio of personal pronouns for each
grammatical form separately (I, me v. he, him...). From measuring the ratios of comparative
and superlative adjectives and adverbs, and question mark and exclamation mark
patterns, we expect clearer distinction of the teenage style. We retrieve the proportions of
inner punctuations, end punctuations and commas, as labelled by the POS tagger. We
further extract the proportion of modal verbs, which we granulate in English on modals
expressing certainty (shall, will...) and uncertainty (could, may..). We also measure the
ratio of future and past verb tenses. Some of the features have not been adapted to
Spanish due to the different POS tagset used.</p>
          <p>
            Readability measures We implemented the most prominent readability measures, such
as the Flesch-Kincaid Grade Level [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], the Automatic Readability Index [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ], the LIX
Index [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], the Coleman-Liau Index [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] and the Flesch Reading Ease [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. The majority of
those is computed using the average word and sentence lengths and number of syllables
per sentence, combined with manually determined weights. The SMOG grade [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] and
the Gunning-Fog Index [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] also consider the number of complex words defined as
words with three or more syllables. We did not adapt these readability measures to the
Spanish corpus.
Semantic features We experimented with retrieving the most frequent semantic triples,
which are popular mainly in question answering tasks. A semantic sentence triple
consists of a discourse entity, a semantic relation and a governing word to which the
entity relates, e.g. i-want-you, you-think-this. We suspected men to refer more to actions
(you-should-X) and women to feelings (i-love-X). However, the rank of these features
decreased with the dataset growth, such as word n-grams did. We performed WordNet
lookup using Java WordNet Library5 to extract the number of senses for nouns and
verbs in the text. Unfortunately, we had to exclude these semantic features from the
final configuration in favour of processing time on the given machine.
Content features, lexical features and stopwords We use word unigrams and bigrams
and stopword unigrams and bigrams based on the Snowball stopword list 6. The Named
Entity Features capture the number of named entities in the article, using the Stanford
Named Entity Recognizer, in particular the 3-class model with distributional similarity
features for tagging all entities of the types Person, Organization and Location. We
use both the overall named entity counts and the average number of named entities
per sentence as features. We also composed 23 word lists inspired by web resources78
and previous work [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] - their full overview can be found in Table 3. As our main
goal was to create a robust, dataset independent system, we focused mainly on lists
expressing various emotions (anger, fear...) or language styles (teenage neologisms, web
slang words, swear words...) rather than discussion topic areas.
          </p>
          <p>List name
Teenage words
Spam words
People words
Emotion words
Family words
Swear words
Computer words
Positive emotions
Positive feelings
Negative words
Sensation words</p>
          <p>Size in Example List name
words
117 bro, geez, tonite, lol Certainty words
85 viagra, casino, shoes, -online Politics words
134 relative, sister, team-mate Clothing words
297 angry, calm, crazy, bored Clarification words
166 family, grandpa, husband, wife Uncertainty words
102 shit, fuck Car words
270 gigabyte, CPU, network Work words
297 cheerful, amused, gracious, joyful School words
89 delighted, proud, pleased Sadness words
507 miserable, scared, stressed, angry Anger words
141 sore, tight, cold, sharp Fear words</p>
          <p>Table 3. Word lists</p>
          <p>Size in Example
words
16 convinced, certain, clearly, always
309 voter, slogan, campaign
279 skirt, trousers, earrings
17 pardon, repeat, example, clarify
13 perhaps, maybe, unsure
207 engine, diesel, gearbox, chrome
287 employee, bonus, recruiter, boss
69 homework, math, teacher
34 sorrowful, hopeless, broken, sad
52 mad, aggressive, outraged
45 nervous, worried, panicked
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>By the time of writing this paper, the challenge results are not yet finalized. We
randomly split our selected training sets to 80% training and 20% test data for the
evaluation. Performance of our systems was compared to the majority class baseline. Results
are shown in Table 5.</p>
      <sec id="sec-5-1">
        <title>5 http://jwordnet.sourceforge.net 6 http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/ 7 http://www.enchantedlearning.com/ 8 http://eqi.org/fw_neg.htm</title>
        <p>5.1</p>
        <sec id="sec-5-1-1">
          <title>Gender profiling</title>
          <p>
            When trained only for gender profiling, our system reaches an accuracy of 0.58 on
English and 0.65 on Spanish dataset. As we observed much noise in the English corpus,
we tested our system also on the English corpus from Mukherjee et al. [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] (3227
authors), on which we reach an accuracy of 0.65, comparable to previous experiments
with similar classification setup [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ].
          </p>
          <p>
            Features that appeared in the 50 best performing ones in at least two datasets are
listed in table 6. Men tend to use more articles, longer words and articles, in accordance
with [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], and talk more about computers. Women are likely to use more emotional
words, smileys and exclamations. They are also more likely to talk about love. Longer
word ngrams have no impact in any of the datasets. In the English dataset we observed
also higher usage of hyperlinks by men, as previously noted by [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ], and highly ranked
readability measures which are based on word length (ARI, LIX). However it is not
the case for readability measures based on number of syllables (Flesch, SMOG, FOG).
Hence the usage of hyperlinks and long words may only suggest, that long words could
be names of specific websites and male blogs in our corpus are simply more likely to
contain spam.
5.2
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Age profiling</title>
          <p>Training our system for age profiling only, we reach an accuracy of 0.53 on the English
and 0.57 on the Spanish dataset.</p>
          <p>
            Top ranked features shared by both datasets are listed in table 6. The older authors
tend to write longer posts using longer words. They pay more attention to commas,
although their sentences are not necessarily longer. Younger authors also use more
pronouns and less nouns and articles - similar features distinguish male and female
authors, as pointed out also by [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. The highest ranked features in the Spanish dataset
were the smileys, commas, and different writing of words using the letter q, instead of
which teenagers use k, such as ke, kiero. Adults also talk more about work and god. On
the English dataset, we observed a higher usage of hyperlinks by older authors, lower
readability and more frequent punctuation. Topic word lists played an important role
younger people use more emotional words, neologisms and slang, talk more about other
people (classmates, parents...) and about computers.
          </p>
          <p>The English dataset suffered from different errors than the Spanish one. While the
major issue in the Spanish dataset was distinguishing teenage authors from 20’s , in
case of English dataset it was to distinguish teenagers from mature authors (30’s). If we
assume that all authors reported their correct age, this might be caused by the plagiarism
and by the fact that the corpus contained also chat conversations of sexual predators
from PAN 2012 9, which we did not particularly address.
5.3</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>Final system</title>
          <p>We reach an overall accuracy of 0.29 on the English dataset and 0.38 on the Spanish
one, with the majority baseline being 0.17. English and Spanish confusion matrices
9 http://www.uni-weimar.de/medien/webis/research/events/pan-12/pan12-web/authorship.html
Surface
Readability
Content
Syntactic
Punctuation
Lexical</p>
          <p>Ending -ly</p>
          <p>Noun rate
Pronoun rate</p>
          <p>Adverb rate
Preposition rate</p>
          <p>Verb rate
Contextuality measure</p>
          <p>Plural ratio
Pronoun singular</p>
          <p>Pronoun I
Maj.class equal distr. baseline
Human evaluation
Surface features
Syntactic &amp; punct. features
Content &amp; lex. features
Synt. &amp; punct.&amp; cont. &amp; lex.</p>
          <p>All features combined
for the final system are displayed in Table 7. For the English corpus, we achieve the
best recall for 20’s men (43%) and the lowest for teenage men (25%), who are often
misclassified as 30’s women (21%). In Spanish we obtain the best recall for 30’s men
(46%) and the worst also for teenage men (31%), but these are often misclassified as
20’s men(24%). The feature ranking in the final system is listed in Table 4. The highest
ranked feature is the number of hyperlinks for English and the number of smileys for
Spanish, and in both cases the ratio of words longer than 5 letters and the punctuation
features. Surface features are ranked surprisingly high as well, followed by readability
measures. While in English the ratio of individual part-of-speech tags plays an
important role, in Spanish POS trigrams and quadrigrams are preferred. From the vocabulary
lists, teenage words, emotion words and work words (see Table 3 are the most
dominant, followed by the expressions of positive feelings and uncertainty. From all the word
n-grams, only the unigrams love and ur were selected for the English corpus and the
unigrams distinguishing letters, such as k, q, ke, que, for the Spanish corpus.</p>
          <p>We compare the performance of the classifiers trained on each feature class
separately and all of them together. The results are shown in Table 8. Neither of the datasets
is sufficiently separable by surface features alone, reaching the accuracy of 0.20, resp.
0.21 only.</p>
          <p>Syntactic features performed well on the Spanish dataset. Errors occured mainly for
teenagers being incorrectly classified as 20’s men (18%), some 20’s men classified as
30’s women (21%), and some 30’s women classified as 30’s men (25%) and vice versa
(20%). On the English dataset syntactic features show lower accuracy. Many women
in their 20’s were classified as 20’s men (22%), some 30’s men were misclassified as
one decade younger (18%), and both genders of teenagers were in 20% of the cases
incorrectly classified as adult women in their 30’s, potentially due to plagiarism.
Content features alone were the best performing of all individual feature classes. They
were suitable to distinguish age groups, but had problem with recognizing gender - on
Spanish dataset 21% of 20’s women classified as 20’s men and vice versa, 29% of 30’s
women as 30’s men and 23% of 30’s men as 30’s women. The English dataset suffered
from similar errors as with syntactic features. Teenage slang and emotion words were
the most helpful word lists, hence removing the topic bias (work words, family words
etc. did not impact the performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>
        To our knowledge, our system was the first to approach the age and gender problem as a
single multiclass classification problem, which helped us to observe both tasks in
context and confirm, that the age and gender profiling are not independent problems. We
have shown that both of them can be determined by the same features (young men are
more emotional than older ones, and so are women, which is visible through stylistic
features). We were the first to employ readability measures in this task and we show,
that these are ranked high in both age and gender classification. This is in accordance
with the high ranks of words longer than five letters, which are used more by men and
mature authors. While we observe, with regards to syntactic and content features,
similar findings to previous work [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] mainly on the Spanish corpus, syntactic
features were not dominant on the English corpus, probably due to strong noise
potentially caused by the presence of spammers, plagiators and sexual predators. When we
run our system on a cleaner English corpus [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we drew similar conclusions to state
of the art literature. We have demonstrated that humans perform worse than
computers in this task (close to random), as they cannot capture patterns in the data well and
rely on content stereotypes. While content features perform overall better than syntactic
features, the accuracy of the latter is satisfactory and can be easier adapted for a
multilingual system, while e.g. translation of teenage slang is challenging without very good
knowledge of the target language.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Mining the blogosphere: Age, gender and the varieties of self-expression</article-title>
          .
          <source>First Monday</source>
          <volume>12</volume>
          (
          <issue>9</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Automatically profiling the author of an anonymous text</article-title>
          .
          <source>Commun. ACM</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          (
          <year>Feb 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Björnsson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Läsbarhet:
          <article-title>Lesbarkeit durch Lix. (Aus dem Schwedischen)</article-title>
          .
          <source>(Pedagogiskt Utvecklingsarbete vid Stockholms Skolor</source>
          .
          <volume>6</volume>
          .),
          <source>Liber</source>
          (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Eckart de Castilho, R.,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>A lightweight framework for reproducible parameter sweeping in information retrieval</article-title>
          .
          <source>In: Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation</source>
          . pp.
          <fpage>7</fpage>
          -
          <lpage>10</lpage>
          . ACM (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Clear</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>The digital word</article-title>
          .
          <source>chap. The British national corpus</source>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>187</lpage>
          . MIT Press, Cambridge, MA, USA (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Coleman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liau</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A computer readability formula designed for machine scoring</article-title>
          .
          <source>Journal of Applied Psychology</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <volume>283</volume>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Corney</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Vel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohay</surname>
          </string-name>
          , G.:
          <article-title>Gender-preferential text mining of e-mail discourse</article-title>
          .
          <source>In: Computer Security Applications Conference</source>
          ,
          <year>2002</year>
          .
          <source>Proceedings. 18th Annual</source>
          . pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grenager</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Incorporating non-local information into information extraction systems by gibbs sampling</article-title>
          .
          <source>In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics</source>
          . pp.
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          . Association for Computational Linguistics (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Flesch</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A new readability yardstick</article-title>
          .
          <source>The Journal of applied psychology 32(3)</source>
          ,
          <volume>221</volume>
          (
          <year>1948</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gleser</surname>
            ,
            <given-names>G.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottschalk</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>John</surname>
          </string-name>
          , W.:
          <article-title>The relationship of sex and intelligence to choice of words: A normative study of verbal behavior</article-title>
          .
          <source>Journal of Clinical Psychology</source>
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <fpage>182</fpage>
          -
          <lpage>191</lpage>
          (
          <year>1959</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gunning</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>The fog index after twenty years</article-title>
          .
          <source>Journal of Business Communication</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>13</lpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mühlhäuser</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steimle</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weimer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zesch</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Darmstadt knowledge processing repository based on uima</article-title>
          .
          <source>In: Proceedings of the First Workshop on Unstructured Information Management Architecture at Biannual Conference of the GSCL</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The weka data mining software: an update</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Heylighen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dewaele</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Variation in the contextuality of language: An empirical measure</article-title>
          .
          <source>Foundations of Science</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <fpage>293</fpage>
          -
          <lpage>340</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kincaid</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fishburne</surname>
            <given-names>Jr</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.P.</given-names>
            ,
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.L.</given-names>
            ,
            <surname>Chissom</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.S.:</surname>
          </string-name>
          <article-title>Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel</article-title>
          .
          <source>Tech. rep., DTIC Document</source>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoni</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Automatically categorizing written texts by author gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Lakoff</surname>
          </string-name>
          , R.T.:
          <article-title>Language and woman's place</article-title>
          , vol.
          <volume>56</volume>
          . Cambridge Univ Press (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Le Cessie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Houwelingen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Ridge estimators in logistic regression</article-title>
          . Applied statistics pp.
          <fpage>191</fpage>
          -
          <lpage>201</lpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>McLaughlin</surname>
            ,
            <given-names>G.H.</given-names>
          </string-name>
          :
          <article-title>Smog grading: A new readability formula</article-title>
          .
          <source>Journal of reading 12(8)</source>
          ,
          <fpage>639</fpage>
          -
          <lpage>646</lpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>McMillan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clifton</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGrath</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.S.:</given-names>
          </string-name>
          <article-title>Women's language: Uncertainty or interpersonal sensitivity and emotionality?</article-title>
          <source>Sex Roles</source>
          <volume>3</volume>
          (
          <issue>6</issue>
          ),
          <fpage>545</fpage>
          -
          <lpage>559</lpage>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Mukherjee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving Gender Classification of Blog Authors</article-title>
          .
          <source>In: EMNLP'10</source>
          . pp.
          <fpage>207</fpage>
          -
          <lpage>217</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Mulac</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Studley</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blau</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The gender-linked language effect in primary and secondary students' impromptu essays</article-title>
          .
          <source>Sex Roles</source>
          <volume>23</volume>
          (
          <issue>9-10</issue>
          ),
          <fpage>439</fpage>
          -
          <lpage>470</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groom</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Handelman</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <source>Pennebaker: Gender Differences in Language Use: An Analysis of 14</source>
          ,000 Text Samples. Discourse Processes pp.
          <fpage>211</fpage>
          -
          <lpage>236</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Francis</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Booth</surname>
          </string-name>
          , R.J.:
          <article-title>Linguistic inquiry and word count: Liwc 2001</article-title>
          . Mahway: Lawrence Erlbaum Associates (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niederhoffer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Psychological aspects of natural language use: Our words, our selves</article-title>
          .
          <source>Annual review of psychology 54(1)</source>
          ,
          <fpage>547</fpage>
          -
          <lpage>577</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          .
          <source>In: Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs</source>
          . pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          , H.:
          <article-title>Treetagger. TC project at the Institute for Computational Linguistics of the University of Stuttgart (</article-title>
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , (U.S.),
          <string-name>
            <surname>A.F.A.M.R.L</surname>
          </string-name>
          .:
          <source>Automated Readability Index. AMRL-TR-66- 220</source>
          , Aerospace Medical Research Laboratories (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.O.:</given-names>
          </string-name>
          <article-title>A comparative study on feature selection in text categorization</article-title>
          .
          <source>In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97)</source>
          . pp.
          <fpage>412</fpage>
          -
          <lpage>420</lpage>
          . Morgan Kaufmann Publishers, Inc. (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , P.:
          <article-title>Predicting gender from blog posts</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Technical Report</source>
          . University of Massachusetts Amherst, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>