=Paper= {{Paper |id=Vol-1446/smlir_submission2 |storemode=property |title=Automatic Age Detection Using Text Readability Features |pdfUrl=https://ceur-ws.org/Vol-1446/smlir_submission2.pdf |volume=Vol-1446 |dblpUrl=https://dblp.org/rec/conf/edm/Pentel15 }} ==Automatic Age Detection Using Text Readability Features== https://ceur-ws.org/Vol-1446/smlir_submission2.pdf
 Automatic Age Detection Using Text Readability Features
                                                              Avar Pentel
                                                   Tallinn University,Tallinn, Estonia
                                                           +372 51 907 739
                                                            pentel@tlu.ee

ABSTRACT                                                               occur in short text is too low and particular word characterizes
In this paper, we present the results of automatic age detection       better the context [3] than author. Some authors use character n-
based on very short texts as about 100 words per author. Instead       grams frequencies to profile users, but again, if we speak about
of widely used n-grams, only text readability features are used in     texts that are only about 100 words long, these features can also
current study. Training datasets presented two age groups -            be very context dependent.
children and teens up to age 16 and adults 20 years and older.         Semantic features are related to third problem - they are costly.
Logistic Regression, Support Vector Machines, C4.5, k-Nearest          Using part of speech tagging systems to categorize words and/or
Neighbor, Naïve Bayes, and Adaboost algorithms were used to            large feature sets for pattern matching, takes time and space. If our
build models. All together ten different models were evaluated         goal is to perform age detection fast and online then it is better to
and compared. Model generated by Support Vector Machine with           have few features that can be extracted instantly on client side.
Adaboost yield to f-score 0.94, Logistic regression to 0.93. A
prototype age detection application was built using the best           In order to avoid all three previously mentioned shortcomings, we
model.                                                                 propose other set of features. We call them readability features,
                                                                       because they are previously used to evaluate texts readability.
Keywords                                                               Texts readability indexes are developed already before
                                                                       computerized text processing, so for example Gunning Fog index
Automatic age detection, readability features, logistic regression,    [4] takes into account complex (or difficult) words, those
support vector machines, Weka.                                         containing 3 or more syllables and average number of words per
                                                                       sentence. If sentence is too long and there are many difficult
1. INTRODUCTION                                                        words, the text is considered not easy to read and more education
One important class of information in user modeling is related to      is needed to understand this kind of text. Gunning Fog index is
user age. Any adaptive technology can use age prediction data. In      calculated with a formula (1) below:
educational    context automatic tutoring systems and
                                                                                                words               complexwords  (1)
recommendation systems, can benefit on age detection.                  GunningFogIndex = 0.4 ×             + 100 ×               
                                                                                                sentences              words     
Automatic age detection has also utilities in crime prevention.
With widespread of social media, people can register accounts          We suppose that authors reading skills and writing skills are
with false age information about themselves. Younger people            correlated and by analyzing author’s text readability, we can infer
might pretend to be older in order to get access to sites that are     his/her education level, which at least to the particular age is
otherwise restricted to them. In the same time older people might      correlated with actual age of an author. As readability indexes
pretend to be younger in order to communicate with youngster. As       work reliably on texts with about 100 words, these are good
we can imagine, this kind of false information might lead to           candidates for our task with short texts.
serious threats, as for instance pedophilia or other criminal          As a baseline we used n-gram features in pre testing. Comparing
activities.                                                            readability features with n-gram features, we found that with
But besides serious crime prevention, automatic age detection can      wider age gap between young and adult groups, readability
by used by educators as indirect plagiarism detector. While there      features making better classifiers if using short texts [5]. Now we
are effective plagiarism detection systems, they do not work when      continue this work with larger dataset and with readability
parents are doing pupils homework or students are using                features only.
somebody else’s original work, which is not published anywhere.        Using best fitting model, we created an online prototype age
There are closed communities where students can buy                    detector.
homework’s for any topic.
                                                                       Section 2 of this paper surveys the literature on age prediction. In
Full scale authorship profiling is not an option here, because large   Section 3 we present our data, features, used machine learning
amount of author texts is needed. Some authors [1] argue, that at      algorithms, and validation. In Section 4 we present our
least 10000 words per author is needed, other that 5000 [2]. But if    classification results and prototype application. We conclude this
we think about business purpose of this kind of age detector,          paper in Section 5 by summarizing and discussing our study.
especially when the purpose is to avoid some criminal acts, then
there is no time to collect large amount of text written by
particular user.
                                                                       2. RELATED WORKS
                                                                       In this section we review related works on age- and other author-
When automatic age detection studies fallow authorship profiling       specific profiling. There are no studies that dealing particularly
conventions then it is related to second problem – the features,       with effect of text sizes in context of age detection. In previous
widely used in authorship profiling, are semantic features.            section we mentioned that by literature for authorship profiling
Probability that some sequence of words, even a single word,           5000 to 10000 words per author is needed [1,2]. Luyckx and
Daelemans [6] reported a dramatic decrease of the performance of       Dong Nguyen and Carolyn P. Rose [13] used linear regression to
the text categorization, when reducing the number of words per         predict author age. They used large dataset with 17947 authors
text fragment to 100. As authorship profiling and authors age          with average text length of 11101 words. They used as features
prediction is not the same task, we focus on works that dealing        word unigrams and POS unigrams and bigrams. Text was tagged
particularly with user age.                                            using the Stanford POS tagger. Additionally they used linguistic
The best-known age based classification results are reported by        inquiry word count tool to extract features. Their best regression
Jenny Tam and Craig H. Martell [7]. They used age groups 13-19,        model had r2 value 0.551 with mean absolute error 6.7.
20-29, 30-39, 40-49 and 50-59. All age groups were in different        As we can see, most of previous studies are using similar features,
size. As features word and character n-grams were used.                word and character n-grams. Additionally special techniques were
Additionally they used emoticons, number of capital letters and        used like POS tagging, Spell Checker, and Linguistic inquiry
number of tokens per post as features. SVM model trained on            word count tool to categorize words. While text features extracted
youngest age group against all others yield to f-score 0,996.          by this equipment are important, they are costly to implement in
Moreover this result seems remarkable, while no age gap between        real life online systems. Similarly large feature sets up to 50,000
two classes was used.                                                  features, most of which are word n-grams, means megabytes of
However we have to address to some limitations of their work that      data. Ideally this kind of detector could work using client browser
might explain high f-scores. Namely they used unbalanced data          resources (JavaScript), and all feature extraction routines and
set (465 versus 1263 in training data set and 116 versus 316 in        models have to be as small as possible.
test set). Unfortunately their report gave only one f-score value,
                                                                       Summarizing previous work in the following table (1), we don’t
but no confusion matrices, ROC or Kappa statistics. We argue,
                                                                       list all possible features. So for example features that are
that with unbalanced data sets, single f-score value is not
                                                                       generated using POS tagging or features generated some word
sufficient to characterize the models accuracy. In such test set –
                                                                       databases are all listed here as word n-grams. Last column gives f-
116 teenagers versus 316 adults - the f-score 0.85 (or 0.42
                                                                       score or the accuracy (with %) according to what characteristic
depending of what is considered positive result) will simply be
                                                                       was given in paper. Most of papers reported many different
achieved by model that always classifies all cases as adults. Also,
                                                                       results, and we list in this summary table only the best result.
it is not clear if reported f-score is weighted average of two
classes’ f-scores or presenting only one class f-score. Secondly it                       Table 1. Summary of previous work
is not clear if given f-score was result of averaging cross                                           Used feature
validation results.




                                                                                                                                                                                 avg. words per author


                                                                                                                                                                                                         separation gap (year)
                                                                                                         types




                                                                                                                                                        training dataset size
It is worth of mentioning, that Jane Lin [8], used the same dataset




                                                                                                                                                                                                                                 result f-score or
                                                                                                            word n-grams
two years earlier in her postgraduate thesis supervised by the         Authors
                                                                                                                           char n-grams




                                                                                                                                                                                                                                 accuracy (%)
                                                                                                                                          emoticons
                                                                                              readability



Craig Martell, and she achieved more modest results. Her best
average f-score in teens versus adult’s classification with SVM
model was 0.786 as compared to Tam’s and Martell reported
0.996. But besides averaged f-scores, Jane Lin also reported
lowest and highest f-scores, and some of her highest f-scores were     Nguyen (2011)                             x                                    17947*                    11101                            0               55.1%
indeed 0.996 as reported in Tam and Martell paper.                     Marquardt (2014)            x             x                            x        7746                      N/a                             0               47.3%
Peersman et al [9] used large sample 10,000 per class and
                                                                       Peersman (2011)                           x              x                     20000                     12.2**                           9                0.917
extracted up to 50,000 features based on word and character n-
grams. Report states, that they used posts average of 12,2 tokens.     Lin (2007)                                x                            x        1728*                     343                             0                0.786
Unfortunately it is not clear if they combined several short posts
from the same author, or used single short message as a unique         Tam & Martell (2009)                      x              x             x        1728*                     343                             0               0.996***
instance in feature extraction. They tested three datasets with        Santosh (2014)                            x                                    236600*                    335                             5                 66%
different age groups –11-15 versus 16+, 11-15 versus 18+ and 11-
15 versus 25+. Also experimentations carried out with number of        This Study                  x                                                   500                         93                            4                  0.94
features, and training set sizes. Best SVM model and with largest      *unbalanced datasets
age gap, largest dataset and largest number of features yield to f-
score 0.88.                                                            **12.2 words was reported average message length, but it is not clear if
                                                                       only one message per user was used or user text was composed form many
Santosh, et al [10,11] used word n-grams as content-based              messages.
features and POS n-grams as style based features. They tested          ***not enough data about this result
three age groups 13-17, 23-27, and 33-47. Using SVM and kNN
models, best classifiers achieved 66% accuracy.
                                                                       3. METHODOLOGY
Marquart [12] tested five age groups 18-24, 25-34, 35-49, 50-64,
and 65-xx. Used dataset was unbalanced and not stratified. He
                                                                       3.1 Sample & Data
also used some of the text readability features as we did in current   We collected short written texts in average 93 words long from
study. Besides of readability features, he used word n-grams,          different social media sources like Facebook, Blog comments, and
HTML tags, and emoticons. Additionally he used different tools         Internet forums. Additionally we used short essay answers from
for feature extraction like psycholinguistic database, sentiment       school online feedback systems and e-learning systems, and e-
strength tool, linguistic inquiry word count tool, and spelling and    mails. No topic specific categorization was made. All authors
grammatical error checker. Combining all these features, his           were identified and their age fall between 9 and 46 years. Most
model yield to modest accuracy of 48,3%.                               authors in our dataset were unique, but we used multiple texts
                                                                       from the same author only in case, when the texts were written in
different age. All texts in the collections were written in the same      3.3 Data Preprocessing
language (Estonian). We chose balanced and stratified datasets            We stored all the digitalized texts in the local machine as separate
with 500 records and with different 4-year age gaps.                      files for each example. A local program was created to extract all
                                                                          previously listed 14 features from each text file. It opened all files
3.2 Features                                                              one by one; extracted features form each file, and stored these
In current study we used in our training dataset different                values in a row of a comma-separated file. In the end of every row
readability features of a text. Readability features are quantitative     it stored data about the age group. A new and simpler algorithm
data about texts, as for instance an average number of characters         was created for syllable counting. Other analogues algorithms for
in the word, syllables in the word, words in the sentences,               Estonian language are intended to exact division of the word to
commas in the sentence and the relative frequency of the words            syllables, but in our case we are only interested on exact number
with 1, 2,.., n syllable. All together 14 different features were         of syllables. As it turns out, syllable counting is possible without
extracted from each text plus classification variable (to which age       knowing exactly where one syllable begins or ends.
class text author belongs).
                                                                          In order to illustrate our new syllable counting algorithm, we give
In all features we used only numeric data and normalized the              some examples about syllables and related rules in Estonian
values using other quantitative characteristics of the text.              language. For instance the word rebane (fox) has 3 syllables: re –
                                                                          ba – ne. In cases like this we can apply one general rule – when
Used Feature set with explanations is presented in Table 2:               single consonant is between vowels, then new syllable begins with
      Table 2. Used features with calculation formulas and                that consonant.
                          explanations                                    When in the middle of word two or more consecutive consonants
Feature              Explanation                                          occur, then usually the next syllable begins with last of those
Average number of                                                         consonants. For instance the word kärbes (fly) – is split as kär-
                          NumberOfCharactersInText
Characters in Word    =                                                   bes, and kärbsed (flies) is split as kärb-sed. The problem is that
                            NumberOfWordsInText                           this and previous rule does not apply to compound words. So for
                     We excluded all white space characters when          example, the word demokraatia (democracy) is split before two
                     counting number of all characters in text            consecutive consonants as de-mo-kraa-tia.
Average number of          NumberOfWordsInText                            Our syllable counting algorithm deals with this problem by
Words in Sentence     =
                          NumberOfSentencesInText                         ignoring all consecutive consonants. We set syllable counter on
                                                                          zero and start comparing two consecutive characters in the word,
Complex Words to          NumberOfComplexWordsInText
                      =                                                   first and second character, then second and third and so on.
all Words ratio
                             NumberOfWordsInText                          General rule is, that we count a new syllable, when the tested pair
                     Complex word is loan from Cunning Fog Index,         of characters is vowel fallowed by consonant. The exception to
                     where it means words with 3 or more syllables. As    this rule is the last character. When the last character is vowel,
                     Cunning Fog index was designed for English, and      then one more syllable is counted.
                     Estonian language has as average more syllables
                     per word, we raised the number of syllables          Implemented syllable counting algorithm as well as other
                     according to this difference to five. Additionally   automatic feature extraction procedures can be seen in section 4.3
                     we count the word complex if it has 13 or more       and in the source code of the prototype application.
                     characters.
Average number of         NumberOfComplexWordsInText                      3.4 Machine Learning Algorithms and Tools
Complex Words in      =
                            NumberOfSentencesInText                       For classification we tested six popular machine-learning
Sentence                                                                  algorithms:
Average number of         NumberOfSyllablesInText
Syllables per Word    =                                                        •    Logistic regression
                           NumberOfWordsInText
                                                                               •    Support Vector Machine
Average number of         NumberOfCommasInText
                      =
Commas per
                          NumberOfSentencesInText                              •    C4.5
Sentence
One Syllable Words
                                                                               •    k-nearest neighbor classifier
                          NumberOfWordsWith1syllableInText
to all Words ratio    =
                               NumberOfWordsInText                             •    Naive Bayes
                                                                               •    AdaBoost.
Similarly as              NumberOfWordsWith _ N − SyllableInText
previous feature,     =                                                   Motivation of choosing those algorithms is based on literature
                                 NumberOfWordsInText
we extracted 7                                                            [14,15]. The suitability of listed algorithms for given data types
features for words   Novel syllable counting algorithm was designed       and for given binary classification task was also taken in to
containing 2, 3, 4   for Estonian language, which is only few lines
                                                                          account. Last algorithm in the list – Adaboost – is actually not
to 8 and more        length and does not include any word matching
syllables.           techniques
                                                                          classification algorithm itself, but an ensemble algorithm, which is
                                                                          intended for use with other classifying algorithms, in order to
                                                                          make a weak classifier stronger. In our task we used Java
                                                                          implementations of listed algorithms that are available in freeware
                                                                          data analysis package Weka [16].
3.5 Validation
For evaluation we used 10 fold cross validation on all models. It                  0,95
means that we partitioned our data to 10 even sized and random                     0,93




                                                                         F-score
parts, and then using one part for validation and other 9 as                       0,91
training dataset. We did so 10 times and then averaged validation                  0,89
results.                                                                           0,87
                                                                                   0,85
3.6 Calculation of final f-scores                                                  0,83
Our classification results are given as weighted average f-scores.                    12-15   13-16   14-17     15-18   16-19      17-20     18-21    19-22
F-score is a harmonic mean between precision and recall. Here is                                              Separation gap
given an example how it is calculated. Let suppose we have a
dataset presenting 100 teenagers and 100 adults. And our model
classifies the results as in fallowing Table 3:                                     Figure 1. Effect of the position of separtion gap
      Table 3. Example illustrating calculation of f-scores             With a best separation gap (16-19) between classes, Logistic
            Classified as =>     teenagers         adults               regression model classified 93,12% of cases right, and Support
                                                                        Vector Machines generated model classified 91,74% of cases.
            teenagers                88              12                 Using Adaboost algorithm combined with classifier generated by
            adults                   30              70                 Support Vector Machine yield to 94.03% correct classification
                                                                        and f-score 0.94. Classification models built by other algorithms
                                                                        performed less effectively as we can see in Table 4.
When classifying teenagers, we have 88 true positives (teenagers
                                                                        Results in fallowing table are divided in to two blocks. In the left
classified as teenagers) and 30 false positives (adults classified as
                                                                        side there are the results of the models generated by listed
teenagers). We also have 12 false negatives (teenagers classified
                                                                        algorithms. In the right side there are the results of the models
as not teenagers) and 70 true negatives (adults classified as not
                                                                        generated by Adaboost algorithm and the same algorithm listed in
teenagers). In following calculations we use abbreviations: TP =
                                                                        the row.
true positive; FP = false positive; TN = true negative; FN = false
                                                                                 Table 4. Averaged F-scores of different models
negative.
 Positive predictive value or precision for teenagers’ class is                                                                 F-score
calculated by formula 2.                                                                                                                  Using Adaboost
              TP      88                                      (2)
precision =        =        = 0.746                                     Logistic Regression                      0.93                          0.93
            TP + FP 88 + 30
                                                                        SVM (standardized)                       0.92                          0.94
 Recall or sensitivity is the rate of correctly classified instances
(true positives) to all actual instances in predicted class.            KNN (k = 4)                              0.86                          0.86
Calculation of recall is given by formula 3.
                                                                        Naïve Bayes                              0.79                          0.84
             TP      88                                       (3)
recall =          =        = 0.88                                       C4.5                                     0.75                          0.84
           TP + FN 88 + 12
F-score is harmonic mean between precision and recall and it is
calculated by formula 4.                                                As we can see in the table above, the best performers were
                precision × recall      2TP                   (4)       classifiers generated by Logistic Regression algorithm and
f − score = 2 ×                    =                                    Support Vector Machine (with standardized data). In the right
                precision + recall 2TP + FP + FN
                                                                        section of the table, where the effect of Adaboost algorithm is
Using data in our example the f-score for teenager class will be        presented, we can see that Adaboost here cannot improve results
0.807, but if we do the same calculations for adult class then the      with Logistic regression classifier, and kNN, but it improves
f-score will be 0.769.                                                  results of SVM, Naïve Bayes and most significantly on C4.5. As
Presenting our results, we use a single f-score value, which is an      Adaboost is intended to build strong classifiers out of weak
average of both classes’ f-score values.                                classifiers, than the biggest effect on C4.5 is expectable. Two best
                                                                        performing classifiers remained still the same after using
4. RESULTS                                                              Adaboost, but now Support Vector Machine outperformed
                                                                        Logistic Regression by 0.91 percent points.
4.1 Classification
Classification effect was related to placement of age separation        4.2 Features with highest impact
gaps in our training datasets. We generated 8 different datasets by     As there is relatively small set of readability features, we did not
placing 4-year separation gap in eight different places. We             used any special feature selection techniques before generating
generated models for all datasets, and present the best models’ f-      models, and evaluating features on the basis of SVM model with
scores on figure 1. As we can see, our classification was most          standardized data. The strongest indicator of an age is the average
effective, when the age separation gap was placed to 16-19 years.       number of words in sentence. Older people tend to write longer
                                                                        sentences. They also are using longer words. Average number of
                                                                        characters per word is in the second place in feature ranking. Best
predictors of younger age group are frequent use of short words
with one or two syllables.
In following Table (5), coefficients of standardized SVM model
are presented.
    Table 5. Features with highest impact in standardized SVM
                              model
     Coefficient   Feature
       1.3639      Words in sentence
       0.8399      Characters in word
        0.258      Complex words in sentence
       -0.2713     Ratio of words with 4 syllables
       -0.3894     Commas per sentence                                                        Figure 3. Feature Extractor
       -0.7451     Ratio of words with 1 syllable                       A new and simpler algorithm (5) was created for syllable
        -0.762     Ratio of words with 2 syllables                      counting. Other analogues algorithms for Estonian language are
                                                                        intended to exact division of the word to syllables, but in our case
4.3 Prototype Application                                               we are only interested on exact number of syllables. As it turns
As the difference between performance of models generated by            out, syllable counting is possible without knowing exactly where
Adaboost with SVM and Logistic Regression is not significant,           one syllable begins or ends. Unfortunately this is true only for
but as from the point of view of implementation, models without         Estonian (and maybe some other similar) language.
Adaboost are simpler, we decided to implement in our prototype
                                                                        function number_of_syllables(w){                      (5)
application Logistic Regression model, which performed best
without using Adaboost.1 We implemented feature extraction
routines and classification function in client-side JavaScript. Our     v="aeiouõäöü"; /* all vowels in Estonian lang. */
prototype application uses written natural language text as an
input, extracts features in exactly the same way we extracted           counter=0;
features for our training dataset and predicts author’s age class
(Fig. 2.).                                                              w=w.split('');/* creates char array of word */


                                                                        wl=w.length; /* number of char’s in word */


                                                                             for(i=0; i < wl - 1; i++){


                                                                                 if(v.indexOf(w[i])!=-1 && v.indexOf(w[i+1])==-1)


                                                                                     counter++;


                                                                            /*


                                                                        if char is vowel and next char is not, then count a
                                                                        syllable (there are some exceptions to this rule, which
                                                                        are easy to program).
                     Figure 2. Application design

                                                                        */
Our feature extraction procedure (Figure 3.) consists 3 stages:
       1.   Text input is split to sentences, and to words, and all              }

            excess white space chars are removed. Some simple
            features, number of characters, number of words,                 if( v.indexOf(w[wl-1]) != -1) counter++;
            number of sentences, are also calculated in this stage.
       2.   In second stage syllables in words are counted.             // if last char in the word is vowel, count new syllable

       3.   All calculated characteristics are normalized using other
                                                                             return counter;
            characteristics of the same text. For example number of
            characters in text divided to number of words in text.
                                                                        }

1
    http://www.tlu.ee/~pentel/age_detector/
Implemented syllable counting algorithm as well as other                 6. REFERENCES
automatic feature extraction procedures can be seen in the source        [1] Burrows, J. 2007. All the way through: testing for authorship
code of the prototype application.2                                          in different frequency strata. Literary and Linguistic
Finally we created simple web interface, where everybody can test            Computing. 22, 1, pp. 27–47. Oxford University Press.
prediction by his/her free input or by copy-paste. As our classifier     [2] Sanderson, C., and Guenter, S. 2007. Short text authorship
was trained on Estonian language, sample Estonian texts are                  attribution via sequence kernels, Markov chains and author
provided on website for both age groups (Fig. 4.).                           unmasking: an investigation. EMNLP’06. Association for
                                         Sample texts for                    Computational Linguistics. pp. 482–491. Stroudsburg, PA,
                                                                             USA.
                                         both age groups
                                                                         [3]    Rao, D. et al. 2010. Classifying latent user attributes in
                                                                               twitter, SMUC '10 Proceedings of the 2nd international
                                                                               workshop on Search and mining user-generated contents. pp.
      Free input form                                                          37-44.
                                                                         [4] Gunning, R. 1952. The Technique of Clear Writing. New
                                                                             York: McGraw–Hill
                                                                         [5] Pentel, A. 2014. A Comparison of Different Feature Sets for
                                                                             Age-Based Classification of Short Texts. Technical report.
                                                                             Tallinn University, Estonia.
                                                                             www.tlu.ee/~pentel/age_detector/Pentel_AgeDetection2b.pdf
Figure 4. Prototype application at
http://www.tlu.ee/~pentel/age_detector/                                  [6] Luyckx, K. and Daelemans, W. 2011. The Effect of Author
                                                                             Set Size and Data Size in Authorship Attribution. Literary
                                                                             and Linguistic Computing, Vol-26, 1.
5. DISCUSSION & CONCLUSIONS                                              [7] Tam, J., Martell, C. H. 2009. Age Detection in Chat.
Automatic user age detection is a task of growing importance in              International Conference on Semantic Computing.
cyber-safety and criminal investigations. One of the user profiling      [8] Lin, J. 2007. Automatic Author profiling of online chat logs.
problems here is related to amount of text needed to perform                 Postgraduate Thesis.
reliable prediction. Usually large training data sets are used to
make such classification models, and also longer texts are needed        [9] Peersman, C. et al. 2011. Predicting Age and Gender in
to make assumptions about author’s age. In this paper we tested              Online Social Networks. SMUC '11 Proceedings of the 3rd
novel set of features for authors age based classification of very           international workshop on Search and mining user-generated
short texts. Used features, formerly known as text readability               contents, pp 37-44, ACM New York, USA.
features, that are used by different readability formulas, as            [10] Santohs, K. et al. 2013. Author Profiling: Predicting Age and
Gunning Fog, and others, proved to be suitable for automatic age              Gender from Blogs. CEUR Workshop Proceedings, Vol-
detection procedure. Comparing different classification algorithms            1179.
we found that Logistic Regression and Support Vector Machines
                                                                         [11] Santosh, K. et al. 2014. Exploiting Wikipedia Categorization
created best models with our data and features, giving both over
                                                                              for Predicting Age and Gender of Blog Authors. UMAP
90% classification accuracy.
                                                                              Workshops 2014.
While this study has generated encouraging results, it has some
                                                                         [12] Marquart, J. et al. 2014. Age and Gender Identification in
limitations. As different readability indexes measure how many
                                                                              Social Media. CEUR Workshop Proceedings, Vol-1180.
years of education is needed to understand the text, we can not
assume that peoples reading, or in our case writing, skills will         [13] Nguyen, D. et al. 2011. Age Prediction from Text using
continuously improve during the whole life. For most people, the              Linear Regression. LaTeCH '11 Proceedings of the 5th ACL-
writing skill level developed in high school will not improve                 HLT Workshop on Language Technology for Cultural
further and therefore it is impossible to discriminate between 25             Heritage, Social Sciences, and Humanities. pp 115-123,
and 30 years old using only those features as we did in current               Association for Computational Linguistics Stroudsburg, PA,
study. But these readability features might be still very useful in           USA.
discriminating between younger age groups, as for instance 7-9,          [14] Wu, X. et al. 2008. Top 10 algorithms in data mining.
10-11, 12-13. The other possible utility of similar approach is to            Knowledge and Information Systems. vol 14, 1–37. Springer.
use it for predicting education level of an adult author.
                                                                         [15] Mihaescu, M. C. 2013. Applied Intelligent Data Analysis:
In order to increase the reliability of results, future studies should        Algorithms for Information Retrieval and Educational Data
also include a larger sample. The value of our work is to present             Mining, pp. 64-111. Zip publishing, Columbus, Ohio.
suitability of a simple feature set for age based classification of      [16] Weka. Weka 3: Data Mining Software in Java. Machine
short texts. And we anticipate a more systematic and in-depth                 Learning Group at the University of Waikato.
study in the near future.                                                     http://www.cs.waikato.ac.nz/ml/weka/

2
    http://www.tlu.ee/~pentel/age_detector/source_code.txt