<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Age Detection Using Text Readability Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Avar Pentel</string-name>
          <email>pentel@tlu.ee</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tallinn University</institution>
          ,
          <addr-line>Tallinn</addr-line>
          ,
          <country country="EE">Estonia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the results of automatic age detection based on very short texts as about 100 words per author. Instead of widely used n-grams, only text readability features are used in current study. Training datasets presented two age groups children and teens up to age 16 and adults 20 years and older. Logistic Regression, Support Vector Machines, C4.5, k-Nearest Neighbor, Naïve Bayes, and Adaboost algorithms were used to build models. All together ten different models were evaluated and compared. Model generated by Support Vector Machine with Adaboost yield to f-score 0.94, Logistic regression to 0.93. A prototype age detection application was built using the best model.</p>
      </abstract>
      <kwd-group>
        <kwd>Automatic age detection</kwd>
        <kwd>readability features</kwd>
        <kwd>logistic regression</kwd>
        <kwd>support vector machines</kwd>
        <kwd>Weka</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Full scale authorship profiling is not an option here, because large
amount of author texts is needed. Some authors [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] argue, that at
least 10000 words per author is needed, other that 5000 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. But if
we think about business purpose of this kind of age detector,
especially when the purpose is to avoid some criminal acts, then
there is no time to collect large amount of text written by
particular user.
      </p>
      <p>When automatic age detection studies fallow authorship profiling
conventions then it is related to second problem – the features,
widely used in authorship profiling, are semantic features.
Probability that some sequence of words, even a single word,
occur in short text is too low and particular word characterizes
better the context [3] than author. Some authors use character
ngrams frequencies to profile users, but again, if we speak about
texts that are only about 100 words long, these features can also
be very context dependent.</p>
      <p>
        Semantic features are related to third problem - they are costly.
Using part of speech tagging systems to categorize words and/or
large feature sets for pattern matching, takes time and space. If our
goal is to perform age detection fast and online then it is better to
have few features that can be extracted instantly on client side.
In order to avoid all three previously mentioned shortcomings, we
propose other set of features. We call them readability features,
because they are previously used to evaluate texts readability.
Texts readability indexes are developed already before
computerized text processing, so for example Gunning Fog index
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] takes into account complex (or difficult) words, those
containing 3 or more syllables and average number of words per
sentence. If sentence is too long and there are many difficult
words, the text is considered not easy to read and more education
is needed to understand this kind of text. Gunning Fog index is
calculated with a formula (1) below:
GunningFogIndex = 0.4 ×  words  + 100 ×  complexwords  (1)
 sentences   words 
We suppose that authors reading skills and writing skills are
correlated and by analyzing author’s text readability, we can infer
his/her education level, which at least to the particular age is
correlated with actual age of an author. As readability indexes
work reliably on texts with about 100 words, these are good
candidates for our task with short texts.
      </p>
      <p>
        As a baseline we used n-gram features in pre testing. Comparing
readability features with n-gram features, we found that with
wider age gap between young and adult groups, readability
features making better classifiers if using short texts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Now we
continue this work with larger dataset and with readability
features only.
      </p>
      <p>Using best fitting model, we created an online prototype age
detector.</p>
      <p>Section 2 of this paper surveys the literature on age prediction. In
Section 3 we present our data, features, used machine learning
algorithms, and validation. In Section 4 we present our
classification results and prototype application. We conclude this
paper in Section 5 by summarizing and discussing our study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORKS</title>
      <p>
        In this section we review related works on age- and other
authorspecific profiling. There are no studies that dealing particularly
with effect of text sizes in context of age detection. In previous
section we mentioned that by literature for authorship profiling
5000 to 10000 words per author is needed [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]. Luyckx and
Daelemans [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] reported a dramatic decrease of the performance of
the text categorization, when reducing the number of words per
text fragment to 100. As authorship profiling and authors age
prediction is not the same task, we focus on works that dealing
particularly with user age.
      </p>
      <p>
        The best-known age based classification results are reported by
Jenny Tam and Craig H. Martell [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They used age groups 13-19,
20-29, 30-39, 40-49 and 50-59. All age groups were in different
size. As features word and character n-grams were used.
Additionally they used emoticons, number of capital letters and
number of tokens per post as features. SVM model trained on
youngest age group against all others yield to f-score 0,996.
Moreover this result seems remarkable, while no age gap between
two classes was used.
      </p>
      <p>However we have to address to some limitations of their work that
might explain high f-scores. Namely they used unbalanced data
set (465 versus 1263 in training data set and 116 versus 316 in
test set). Unfortunately their report gave only one f-score value,
but no confusion matrices, ROC or Kappa statistics. We argue,
that with unbalanced data sets, single f-score value is not
sufficient to characterize the models accuracy. In such test set –
116 teenagers versus 316 adults - the f-score 0.85 (or 0.42
depending of what is considered positive result) will simply be
achieved by model that always classifies all cases as adults. Also,
it is not clear if reported f-score is weighted average of two
classes’ f-scores or presenting only one class f-score. Secondly it
is not clear if given f-score was result of averaging cross
validation results.</p>
      <p>
        It is worth of mentioning, that Jane Lin [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], used the same dataset
two years earlier in her postgraduate thesis supervised by the
Craig Martell, and she achieved more modest results. Her best
average f-score in teens versus adult’s classification with SVM
model was 0.786 as compared to Tam’s and Martell reported
0.996. But besides averaged f-scores, Jane Lin also reported
lowest and highest f-scores, and some of her highest f-scores were
indeed 0.996 as reported in Tam and Martell paper.
      </p>
      <p>
        Peersman et al [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used large sample 10,000 per class and
extracted up to 50,000 features based on word and character
ngrams. Report states, that they used posts average of 12,2 tokens.
Unfortunately it is not clear if they combined several short posts
from the same author, or used single short message as a unique
instance in feature extraction. They tested three datasets with
different age groups –11-15 versus 16+, 11-15 versus 18+ and
1115 versus 25+. Also experimentations carried out with number of
features, and training set sizes. Best SVM model and with largest
age gap, largest dataset and largest number of features yield to
fscore 0.88.
      </p>
      <p>
        Santosh, et al [
        <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
        ] used word n-grams as content-based
features and POS n-grams as style based features. They tested
three age groups 13-17, 23-27, and 33-47. Using SVM and kNN
models, best classifiers achieved 66% accuracy.
      </p>
      <p>
        Marquart [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] tested five age groups 18-24, 25-34, 35-49, 50-64,
and 65-xx. Used dataset was unbalanced and not stratified. He
also used some of the text readability features as we did in current
study. Besides of readability features, he used word n-grams,
HTML tags, and emoticons. Additionally he used different tools
for feature extraction like psycholinguistic database, sentiment
strength tool, linguistic inquiry word count tool, and spelling and
grammatical error checker. Combining all these features, his
model yield to modest accuracy of 48,3%.
      </p>
      <p>
        Dong Nguyen and Carolyn P. Rose [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] used linear regression to
predict author age. They used large dataset with 17947 authors
with average text length of 11101 words. They used as features
word unigrams and POS unigrams and bigrams. Text was tagged
using the Stanford POS tagger. Additionally they used linguistic
inquiry word count tool to extract features. Their best regression
model had r2 value 0.551 with mean absolute error 6.7.
As we can see, most of previous studies are using similar features,
word and character n-grams. Additionally special techniques were
used like POS tagging, Spell Checker, and Linguistic inquiry
word count tool to categorize words. While text features extracted
by this equipment are important, they are costly to implement in
real life online systems. Similarly large feature sets up to 50,000
features, most of which are word n-grams, means megabytes of
data. Ideally this kind of detector could work using client browser
resources (JavaScript), and all feature extraction routines and
models have to be as small as possible.
      </p>
      <p>Summarizing previous work in the following table (1), we don’t
list all possible features. So for example features that are
generated using POS tagging or features generated some word
databases are all listed here as word n-grams. Last column gives
fscore or the accuracy (with %) according to what characteristic
was given in paper. Most of papers reported many different
results, and we list in this summary table only the best result.
different age. All texts in the collections were written in the same
language (Estonian). We chose balanced and stratified datasets
with 500 records and with different 4-year age gaps.</p>
    </sec>
    <sec id="sec-3">
      <title>3.2 Features</title>
      <p>In current study we used in our training dataset different
readability features of a text. Readability features are quantitative
data about texts, as for instance an average number of characters
in the word, syllables in the word, words in the sentences,
commas in the sentence and the relative frequency of the words
with 1, 2,.., n syllable. All together 14 different features were
extracted from each text plus classification variable (to which age
class text author belongs).</p>
      <p>In all features we used only numeric data and normalized the
values using other quantitative characteristics of the text.
Used Feature set with explanations is presented in Table 2:</p>
    </sec>
    <sec id="sec-4">
      <title>3.3 Data Preprocessing</title>
      <p>We stored all the digitalized texts in the local machine as separate
files for each example. A local program was created to extract all
previously listed 14 features from each text file. It opened all files
one by one; extracted features form each file, and stored these
values in a row of a comma-separated file. In the end of every row
it stored data about the age group. A new and simpler algorithm
was created for syllable counting. Other analogues algorithms for
Estonian language are intended to exact division of the word to
syllables, but in our case we are only interested on exact number
of syllables. As it turns out, syllable counting is possible without
knowing exactly where one syllable begins or ends.</p>
      <p>In order to illustrate our new syllable counting algorithm, we give
some examples about syllables and related rules in Estonian
language. For instance the word rebane (fox) has 3 syllables: re –
ba – ne. In cases like this we can apply one general rule – when
single consonant is between vowels, then new syllable begins with
that consonant.</p>
      <p>When in the middle of word two or more consecutive consonants
occur, then usually the next syllable begins with last of those
consonants. For instance the word kärbes (fly) – is split as
kärbes, and kärbsed (flies) is split as kärb-sed. The problem is that
this and previous rule does not apply to compound words. So for
example, the word demokraatia (democracy) is split before two
consecutive consonants as de-mo-kraa-tia.</p>
      <p>Our syllable counting algorithm deals with this problem by
ignoring all consecutive consonants. We set syllable counter on
zero and start comparing two consecutive characters in the word,
first and second character, then second and third and so on.
General rule is, that we count a new syllable, when the tested pair
of characters is vowel fallowed by consonant. The exception to
this rule is the last character. When the last character is vowel,
then one more syllable is counted.</p>
      <p>Implemented syllable counting algorithm as well as other
automatic feature extraction procedures can be seen in section 4.3
and in the source code of the prototype application.</p>
    </sec>
    <sec id="sec-5">
      <title>3.4 Machine Learning Algorithms and Tools</title>
      <p>For classification we tested six popular machine-learning
algorithms:
•
•
•
•
•
•</p>
      <sec id="sec-5-1">
        <title>Logistic regression</title>
      </sec>
      <sec id="sec-5-2">
        <title>Support Vector Machine</title>
        <p>C4.5
k-nearest neighbor classifier</p>
      </sec>
      <sec id="sec-5-3">
        <title>Naive Bayes</title>
      </sec>
      <sec id="sec-5-4">
        <title>AdaBoost.</title>
        <p>
          Motivation of choosing those algorithms is based on literature
[
          <xref ref-type="bibr" rid="ref14 ref15">14,15</xref>
          ]. The suitability of listed algorithms for given data types
and for given binary classification task was also taken in to
account. Last algorithm in the list – Adaboost – is actually not
classification algorithm itself, but an ensemble algorithm, which is
intended for use with other classifying algorithms, in order to
make a weak classifier stronger. In our task we used Java
implementations of listed algorithms that are available in freeware
data analysis package Weka [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.5 Validation</title>
      <p>For evaluation we used 10 fold cross validation on all models. It
means that we partitioned our data to 10 even sized and random
parts, and then using one part for validation and other 9 as
training dataset. We did so 10 times and then averaged validation
results.</p>
    </sec>
    <sec id="sec-7">
      <title>3.6 Calculation of final f-scores</title>
      <p>Our classification results are given as weighted average f-scores.
F-score is a harmonic mean between precision and recall. Here is
given an example how it is calculated. Let suppose we have a
dataset presenting 100 teenagers and 100 adults. And our model
classifies the results as in fallowing Table 3:
When classifying teenagers, we have 88 true positives (teenagers
classified as teenagers) and 30 false positives (adults classified as
teenagers). We also have 12 false negatives (teenagers classified
as not teenagers) and 70 true negatives (adults classified as not
teenagers). In following calculations we use abbreviations: TP =
true positive; FP = false positive; TN = true negative; FN = false
negative.</p>
      <p>Positive predictive value or precision for teenagers’ class is
calculated by formula 2.
precision =</p>
      <p>TP
TP + FP</p>
    </sec>
    <sec id="sec-8">
      <title>4. RESULTS</title>
    </sec>
    <sec id="sec-9">
      <title>4.1 Classification</title>
      <p>Classification effect was related to placement of age separation
gaps in our training datasets. We generated 8 different datasets by
placing 4-year separation gap in eight different places. We
generated models for all datasets, and present the best models’
fscores on figure 1. As we can see, our classification was most
effective, when the age separation gap was placed to 16-19 years.
(2)
(3)
(4)
0,95
0,93
re0,91
co0,89
s
-F0,87
0,85
0,83
12-15
13-16
14-17
15-18
16-19
17-20
18-21</p>
      <p>19-22</p>
      <p>Separation gap
With a best separation gap (16-19) between classes, Logistic
regression model classified 93,12% of cases right, and Support
Vector Machines generated model classified 91,74% of cases.
Using Adaboost algorithm combined with classifier generated by
Support Vector Machine yield to 94.03% correct classification
and f-score 0.94. Classification models built by other algorithms
performed less effectively as we can see in Table 4.</p>
      <p>Results in fallowing table are divided in to two blocks. In the left
side there are the results of the models generated by listed
algorithms. In the right side there are the results of the models
generated by Adaboost algorithm and the same algorithm listed in
the row.</p>
    </sec>
    <sec id="sec-10">
      <title>4.2 Features with highest impact</title>
      <p>As there is relatively small set of readability features, we did not
used any special feature selection techniques before generating
models, and evaluating features on the basis of SVM model with
standardized data. The strongest indicator of an age is the average
number of words in sentence. Older people tend to write longer
sentences. They also are using longer words. Average number of
characters per word is in the second place in feature ranking. Best
predictors of younger age group are frequent use of short words
with one or two syllables.</p>
      <p>In following Table (5), coefficients of standardized SVM model
are presented.</p>
    </sec>
    <sec id="sec-11">
      <title>4.3 Prototype Application</title>
      <p>As the difference between performance of models generated by
Adaboost with SVM and Logistic Regression is not significant,
but as from the point of view of implementation, models without
Adaboost are simpler, we decided to implement in our prototype
application Logistic Regression model, which performed best
without using Adaboost.1 We implemented feature extraction
routines and classification function in client-side JavaScript. Our
prototype application uses written natural language text as an
input, extracts features in exactly the same way we extracted
features for our training dataset and predicts author’s age class
(Fig. 2.).</p>
      <p>Text input is split to sentences, and to words, and all
excess white space chars are removed. Some simple
features, number of characters, number of words,
number of sentences, are also calculated in this stage.</p>
      <sec id="sec-11-1">
        <title>In second stage syllables in words are counted. All calculated characteristics are normalized using other characteristics of the same text. For example number of characters in text divided to number of words in text.</title>
        <p>1 http://www.tlu.ee/~pentel/age_detector/
A new and simpler algorithm (5) was created for syllable
counting. Other analogues algorithms for Estonian language are
intended to exact division of the word to syllables, but in our case
we are only interested on exact number of syllables. As it turns
out, syllable counting is possible without knowing exactly where
one syllable begins or ends. Unfortunately this is true only for
Estonian (and maybe some other similar) language.
function number_of_syllables(w){
(5)
v="aeiouõäöü"; /* all vowels in Estonian lang. */
counter=0;
w=w.split('');/* creates char array of word */
wl=w.length; /* number of char’s in word */
for(i=0; i &lt; wl - 1; i++){
if(v.indexOf(w[i])!=-1 &amp;&amp; v.indexOf(w[i+1])==-1)
counter++;
if char is vowel and next char is not, then count a
syllable (there are some exceptions to this rule, which
are easy to program).
/*
*/</p>
        <p>}
}
if( v.indexOf(w[wl-1]) != -1) counter++;
return counter;
// if last char in the word is vowel, count new syllable
Implemented syllable counting algorithm as well as other
automatic feature extraction procedures can be seen in the source
code of the prototype application.2
Finally we created simple web interface, where everybody can test
prediction by his/her free input or by copy-paste. As our classifier
was trained on Estonian language, sample Estonian texts are
provided on website for both age groups (Fig. 4.).</p>
      </sec>
      <sec id="sec-11-2">
        <title>Sample texts for both age groups</title>
      </sec>
      <sec id="sec-11-3">
        <title>Free input form</title>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>5. DISCUSSION &amp; CONCLUSIONS</title>
      <p>Automatic user age detection is a task of growing importance in
cyber-safety and criminal investigations. One of the user profiling
problems here is related to amount of text needed to perform
reliable prediction. Usually large training data sets are used to
make such classification models, and also longer texts are needed
to make assumptions about author’s age. In this paper we tested
novel set of features for authors age based classification of very
short texts. Used features, formerly known as text readability
features, that are used by different readability formulas, as
Gunning Fog, and others, proved to be suitable for automatic age
detection procedure. Comparing different classification algorithms
we found that Logistic Regression and Support Vector Machines
created best models with our data and features, giving both over
90% classification accuracy.</p>
      <p>While this study has generated encouraging results, it has some
limitations. As different readability indexes measure how many
years of education is needed to understand the text, we can not
assume that peoples reading, or in our case writing, skills will
continuously improve during the whole life. For most people, the
writing skill level developed in high school will not improve
further and therefore it is impossible to discriminate between 25
and 30 years old using only those features as we did in current
study. But these readability features might be still very useful in
discriminating between younger age groups, as for instance 7-9,
10-11, 12-13. The other possible utility of similar approach is to
use it for predicting education level of an adult author.
In order to increase the reliability of results, future studies should
also include a larger sample. The value of our work is to present
suitability of a simple feature set for age based classification of
short texts. And we anticipate a more systematic and in-depth
study in the near future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>All the way through: testing for authorship in different frequency strata</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          .
          <volume>22</volume>
          ,
          <issue>1</issue>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>47</lpage>
          . Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Guenter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation</article-title>
          .
          <source>EMNLP'06. Association for Computational Linguistics</source>
          . pp.
          <fpage>482</fpage>
          -
          <lpage>491</lpage>
          . Stroudsburg, PA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          et al.
          <year>2010</year>
          .
          <article-title>Classifying latent user attributes in twitter</article-title>
          ,
          <source>SMUC '10 Proceedings of the 2nd international workshop on Search</source>
          and
          <article-title>mining user-generated contents</article-title>
          . pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gunning</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>1952</year>
          .
          <article-title>The Technique of Clear Writing</article-title>
          . New York: McGraw-Hill
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Pentel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>A Comparison of Different Feature Sets for Age-Based Classification of Short Texts</article-title>
          .
          <source>Technical report</source>
          . Tallinn University, Estonia. www.tlu.ee/~pentel/age_detector/Pentel_AgeDetection2b.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Luyckx</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>The Effect of Author Set Size and Data Size in Authorship Attribution</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          , Vol-
          <volume>26</volume>
          ,
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martell</surname>
            ,
            <given-names>C. H.</given-names>
          </string-name>
          <year>2009</year>
          . Age Detection in Chat. International Conference on Semantic Computing.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Automatic Author profiling of online chat logs</article-title>
          .
          <source>Postgraduate Thesis</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Peersman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          et al.
          <year>2011</year>
          .
          <article-title>Predicting Age and Gender in Online Social Networks</article-title>
          .
          <source>SMUC '11 Proceedings of the 3rd international workshop on Search and mining user-generated contents</source>
          , pp
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          , ACM New York, USA.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Santohs</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          et al.
          <year>2013</year>
          .
          <article-title>Author Profiling: Predicting Age and Gender from Blogs</article-title>
          .
          <source>CEUR Workshop Proceedings, Vol1179.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Santosh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          et al.
          <year>2014</year>
          .
          <article-title>Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors</article-title>
          .
          <source>UMAP Workshops</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Marquart</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.
          <year>2014</year>
          .
          <article-title>Age and Gender Identification in Social Media</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , Vol-
          <volume>1180</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          et al.
          <year>2011</year>
          .
          <article-title>Age Prediction from Text using Linear Regression</article-title>
          .
          <source>LaTeCH '11 Proceedings of the 5th ACLHLT Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities</source>
          . pp
          <fpage>115</fpage>
          -
          <lpage>123</lpage>
          , Association for Computational Linguistics Stroudsburg, PA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          et al.
          <year>2008</year>
          .
          <article-title>Top 10 algorithms in data mining</article-title>
          .
          <source>Knowledge and Information Systems</source>
          . vol
          <volume>14</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Mihaescu</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Applied Intelligent Data Analysis: Algorithms for Information Retrieval and Educational Data Mining</article-title>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>111</lpage>
          . Zip publishing, Columbus, Ohio.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Weka</surname>
          </string-name>
          .
          <article-title>Weka 3: Data Mining Software in Java</article-title>
          . Machine Learning Group at the University of Waikato. http://www.cs.waikato.ac.nz/ml/weka/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>