An Automatic Author Profiling from Non-Normative
                Lithuanian Texts
                       Monika Briedienė                                                          Jurgita Kapočiutė - Dzikienė
                  Vytautas Magnus University                                                      Vytautas Magnus University
                      Kaunas, Lithuania                                                                 Kaunas, Lithuania
                   monika.briediene@vdu.lt                                                     jurgita.kapociute-dzikiene@vdu.lt


     Abstract - This paper presents author profiling research done             the style of the text. It is possible due to a phenomenon of the
on the Lithuanian texts using automatic machine learning                       existing human stylome (an analogue of a genome) which
methods. Our research is novel and challenging due to the                      allows each person to formulate sentences and express their
following reasons: 1) a big number of author profiling dimensions,             thoughts in his/her special and unique ways [1]. Similarly, in
i.e., gender, age, education, marital status and personality type; 2)          many research studies, it is claimed that this phenomenon
very short (avg. ~ 24 tokens) non-normative texts; 3) vocabulary               occurs not only in the style of individual, but also in the style of
rich highly inflective Lithuanian language. We have performed                  their groups, sharing the same demographic characteristics (as
experimental investigation that resulted in choosing automatic                 age, gender, education or marital status) or the personality type.
author profiling methods (in particular, classifiers and feature
types) that have reached the highest accuracy on the pure texts                In general, the identification of an authorship has the long
without any meta-information about their authors. Out of a                     history dating back to 1887 [2], but with the Internet era its
number of experimentally investigated classifiers using lexical or             popularity gained dramatically. Therefore the author profiling
symbolic features the Naïve Bayes Multinomial method with                      – responsible for the automatic extraction of the meta-
character n-grams feature type yielded the best performance                    information about some author (as, e.g., age [3], gender [4],
reaching 84.3%, 52.7%, 79.6%, 76.6%, 79.1% of accuracy in                      psychological status [5], etc.) – nowadays is an active and
gender, age, education, marital status and personality type                    important research area. The author profiling research is mainly
detection tasks, respectively.
                                                                               focused on the English language, whereas for the Lithuanian
    Keywords—gender detection, age detection, education detection,
                                                                               language it is rather a new subject. The age, gender and political
marital status detection, personality type detection, author profiling,        views profiling tasks are solved using parliamentary transcripts
the non-normative Lithuanian language, supervised machine                      [6]; age and gender profiling tasks are solved using the
learning                                                                       Lithuanian literary texts [17]. However, these research works
                                                                               are done on rather long (having ~ 217 tokens on average) and
                        I.     INTRODUCTION                                    normative Lithuanian texts. The non-normative Lithuanian
    In today’s world, numbers of electronic texts have exceeded                language (which is the object of research in this paper) is much
paper texts by several times. However, the vast majority of                    more complicated: it is full of out-of-vocabulary words, jargon,
these texts are written anonymously or pseudonymously. For                     foreign language insertions and neologisms. Besides, it faces an
this reason, court analysts, web forum administrators, social                  important problem of diacritics ignorance (where ą, č, ę, ė, į, š,
networks supervisors are increasingly facing impersonation,                    ų, ū, ž are often replaced with the appropriate ASCII
bullying or harassment, discloser of confidential information,                 equivalents). However, the author profiling task on the non-
dissemination of disinformation, and other issues. Uncovering                  normative Lithuanian texts is issued using the gender
the exact identity of the person is very complicated and                       dimension only [7]. Moreover, some sub-tasks of the author’s
sometimes unsolvable task, whereas to reveal his/her meta-                     profiling on the education, marital status, and personality type
information (i.e., demographic features: age, gender, etc.) is                 dimensions have never even been solved before using any types
easier, but still very useful. The revealed meta-information that,             of Lithuanian texts. Consequently, the purpose of this paper is
e.g., a 50-year-old man is impersonating a 10-year-old girl may                to fill in the above mentioned gap: i.e., to offer the methods
encourage the police to dive more detailed into the data or even               (classifiers, their parameters, and features types) able to create
take decisive actions for the criminal offense. The manual                     the automatic author profiles from the short non-normative
Internet space monitoring and manual text analysis is hardly                   Lithuanian texts (Facebook posts, comments and messages).
possible, because it requires enormous amounts of human                            The final goal of this research can be achieved after
resources. Thus, natural language processing technologies                      performing the following intermediate tasks: (1) a related work
become the only solution for tacking similar problems.                         analysis (see Section II), (2) a construction of the representative
    The author profiling experimental investigations confirm                   corpus containing non-normative Lithuanian texts (see Section
that the authors’ characteristics can be determined by analyzing               III), (3) an analytical selection of the most promising methods
                                                                               (see Section IV), (4) a precise experimental evaluation of
                                                                               selected methods (see Section V). The conclusions
  Copyright held by the author(s).                                             (recommendations) and future research plans for the author


                                                                          99
profiling tasks when using short non-normative Lithuanian                         The majority of research done for solving the author
texts are in Section VI.                                                      profiling tasks involve these popular supervised approaches
                                                                              (e.g., Naïve Bayes [12], Naïve Bayes Multinomial [13], Support
                       II.   RELATED WORKS                                    Vector Machines [14]) and similarity-based (e.g., k-Nearest
     There are many methods used to deal with the author                      Neighbor) or the comparative experiments proving the
profiling task. All existing approaches can be grouped                        superiority of Naïve Bayes Multinomial and Support Vector
according to the following criteria: the percentage of training               Machines (as in [15]). Since, it is proved that these approaches
instances in the dataset, an amount of information they provide,              are not only the most popular, but the most accurate for the
(i.e., a recognition-training feedback) and the nature of                     author profiling tasks, further we will focus only on these types
knowledge. Based on these criteria, the approaches are [8]:                   of methods.
Rule-based, Unsupervised Machine Learning, Supervised                             When analyzing the Lithuanian non-normative texts, we
Machine Learning, and Similarity-Based.                                       follow the recommendations formulated for the other
    The obsolete rule-based methods use rules that have been                  languages. However, a language factor itself should also be
constructed by human-experts. The development process itself                  taken into account. The Lithuanian language (used in our
is very difficult and requires linguistic competence. In addition,            research) has rich vocabulary, morphology, word derivation
rules are created for the specific solution, therefore are hardly             system and relatively free-word order in a sentence. Despite the
transferable to the new areas.                                                Lithuanian language (especially non-normative) is rather
                                                                              complicated, some of previously mentioned language
    Unsupervised machine learning (or clustering methods) is                  characteristics do not necessary have to complicate our solving
chosen when no meta-information (i.e., no training instances) is              tasks, i.e., it might occur that our investigated groups of
provided. Examples of the text are grouped according to their                 individuals are bind to the very different, but very
similarity. The main disadvantage of these methods is that their              representative non-normative sentence structures or
grouping does not necessarily correspond an imaginary                         vocabularies.
grouping of a human. Usually because of their low accuracy,
these methods are not popular in author profiling tasks.                                                III.   CORPUS
     If texts are supplemented with the necessary meta-                           Unfortunately, the author profiling benchmark corpora are
information about the particular author characteristic (so-called             not available on the Internet for the non-normative Lithuanian
class) the supervised machine learning is one of two best                     language, therefore in this research we are using the corpus that
choices. The stylistic, lexical or symbolic text characteristics              was specifically created for our tasks. The corpus is composed
(i.e., so-called features) are presented as the input. The classifier         of unprocessed posts (without any appearance of the third party
summarizes training information and creates a model as its                    texts) manually harvested from the Facebook social network in
output. This model afterwards can be used for the author                      the period of 2016-2017. The author profiling research for the
profiling of unseen texts. A main disadvantage of all supervised              other languages mostly focuses on the Twitter [15], but not
machine learning methods is that they require a comprehensive                 Facebook [16] texts. It is due to the convenient APIs that help
and representative training data to create a reliable and                     crawling tweets; besides, in some countries Twitter is more
comprehensive model. The advantage of supervised methods is                   popular than Facebook. In our work we have chosen Facebook
that they can be flexibly adjusted to the new tasks or areas by               social network due to its popularity in Lithuania and
adding new text samples and retraining the classifier. The deep               opportunity to store more demographic characteristics such as
learning methods [9] [10] (that became extremely popular                      education, marital status (not only age or gender) reported by
recently for many text classification tasks) are also                         the users themselves.
representatives of this group. The popularity of the Neural                       Our corpus contains posts, comments and messages of 200
Networks (Convolutional [10], Recurrent [9], etc.) is also                    individuals (for statistics see Figure 1), one text per person (to
growing recently. Such popularity has also been driven by the                 avoid the authorship attribution impact on the author profiling
technical progress: it has led to the faster computing and                    results). 102 and 98 texts belong to women and men,
processing huge amounts of data. The deep learning is used for                respectively (see Gender column in Figure 2). The youngest
the author profiling [10] and authorship attribution [9] tasks.               participant is 18 years old, the oldest – 78, the mean age of
Despite the deep learning methods are successfully applied in                 respondents is ~ 36.9. Respondents are divided into six age
many natural language processing tasks, on the smaller datasets               groups (see Age column in Figure 2). The selected grouping is
(as in our paper) they underperform the other supervised                      used in surveys of psychologists, in the social studies, in the
machine learning approaches, such as Support Vector Machines                  largest European and Lithuanian data archives. Besides, it is
or Naïve Bayes Multinomial [11]. The similarity-based                         also used in the similar research works [17], making our results
approaches (often researched and discussed separately) are very               more comparable to the previously reported for the Lithuanian
similar to the supervised machine learning approaches by their                language.
nature. The only difference is that instead of creating a model,
they preserve all training instances and use similarity measures                  The education level of 105 and 95 respondents is higher and
to determine to which of available classes some incoming                      secondary, respectively (see Education column in Figure 2).
unseen instance is the most similar. An advantage of similarity-              114 and 86 individuals claimed they are married and single,
based methods is that they keep the entire training set; so the               respectively (see Marital status column in Figure 2). 112 and
information is not lost during generalization.                                88 people attributed themselves as extrovert and introvert,
                                                                              respectively (see Personality type column in Figure 2).


                                                                        100
    The corpus consists of 4.830 tokens (including in-the-                             by Cortes C. and Vapnik V. in 1995 [18]). It is a
vocabulary and out-the-vocabulary words, numbers, and non-                             discriminatory instance-based approach, currently
normative “words” with embedded digits or punctuation) in                              considered as one of the most popular text classification
total. The shortest text (without symbols and emoticons) is only                       techniques. The method effectively copes with the huge
2 tokens length, the longest – 161, the average length per text is                     number of features, sparse feature vectors and does not
only ~ 24 tokens.                                                                      perform an aggressive feature selection, which may result
                                                                                       in the loss of valuable information and accuracy [19].
                     Posts       Comments           Messages                           Another representatives are Naïve Bayes (NB) and its
                                                                                       modification Naïve Bayes Multinomial (NBM)
                                                                                       (introduced by Lewis D. D. and Gale W. A. in 1994 [20]).
                                                                                       These techniques are generative profile-based approaches,
                                                                                       often chosen due to their simplicity and sufficiently high
                              19%
                                                                                       accuracy. The NB assumption about the feature
                                                                                       independence allows each parameter to be learned
                                                  49%
                                                                                       separately; these methods work especially well when a
                                                                                       number of features having equal significance is high; they
                               32%                                                     are fast and do not require large data storage resources.
                                                                                       Moreover, Bayesian methods often play a baseline role in
                                                                                       the evaluation.
                                                                                  Similarity-based. A representative of this type is the IBK
   Fig. 1 A percentage of posts, comments and messages in our corpus                method (introduced by Aha D. and Kibler D. in 1991 [21]).
                                                                                    This nearest neighbors’ classifier chooses the appropriate k
                        IV.    METHODOLOGY                                          value, based on the k-time cross-check after the distance
    The methodological part covers two main directions: 1) the                      evaluation (between a testing instance and all samples in
proper selection of the classifier and 2) the proper selection of                   the training set). Another representative is Kstar method
the feature type.                                                                   (introduced by Cleary J. G. and Trigg L. E. in 1995 [22]).
                                                                                    On the contrary to IBK, Kstar calculates not a distance
    To come up with the very best, we have analyzed the                             measure, but a similarity function. It differs from the other
following classifiers of these groups:                                              approaches of this type, because uses the entropy-based
 Supervised machine learning. A representative of this type                        distance function. These two last-mentioned methods store
   is the Support Vector Machine (SVM) method (introduced


                                         Gender            Age             Education        Marital status      Personality type


                                              Fig. 2 Distribution of respondents according to characteristics


                                                                           101
       all available instances; therefore, are prevented from the                                Our preliminary experiments have involved the selection of
       information loss during training.                                                     the most accurate classification technique when using word
                                                                                             tokenizer with unigrams (n=1) (denoted as word1), n-gram
    Our second research direction involved the proper selection                              tokenizer with unigrams (n=1) (lex1) and tetra-grams (n=4)
of the feature type. In our experiments we have explored:                                    (lex4), alphabetic tokenizer with unigrams (n=1) (alph1),
                Lexical feature types: token uni-grams (n=1)                                character n-gram tokenizer with unigrams (n=1) (char1) and
                 (individual tokens) and token tetra-grams (n=4)                             tetra-grams (n=4) (char4) (the best results are presented in
                 (sequences of 4 tokens in a window sliding one                              Figures 3-7). The overall best results were achieved with SVM
                 token at the time). For instance, from the phrase                           and NBM methods and character n-grams2. These methods also
                 “author profiling from the Lithuanian texts” it                             demonstrated the best performance in the author profiling tasks
                 would be generated 6 unigrams: “author”,                                    on the morphologically complex Arabic language [15].
                 “profiling”, “from”, “the”, “Lithuanian”, “texts”                               In our later experiments we have performed the tuning of
                 and 3 tetra-grams “author profiling from the”,                              the character n-gram parameter n by keeping the classifier
                 “profiling from the Lithuanian”, “from the                                  parameter stable and equal to SVM or NBM (because these
                 Lithuanian texts”.                                                          classifiers demonstrated the best performance in the classifier
                Character features, in particular, character n-grams                        selection experiments). The obtained results with the different
                 similarly to token n-grams are sequences of items,                          author profiling dimensions are reported in Figure 8.
                 but instead of tokens they contain characters. For                              The overall best results (reaching 0.843 of the accuracy and
                 instance, from the phrase “author profiling” it                             0.843 of f-score) on the short non-normative Lithuanian texts in
                 would be generated the following document-level                             the gender detection task were achieved with the NBM and
                 character 4-grams: “auth”, “utho”, “thor”, “hor_”,                          character n-grams of n = [6, 7] as the feature type. The best
                 “or_p”, “r_pr”, etc. (where “_” marks the                                   results reaching 0.527 of accuracy and 0,473 of f-score on the
                 whitespace). It is important to mention that a value                        age dimension were achieved with the NBM and character n-
                 of n not necessary has to be fixed. E.g., with the                          grams (of n = [5, 5]). In the education detection NBM and
                 interval n = [2,4] all bi-grams (n=2), trigrams                             character n-grams (of n = [5, 5]) demonstrated the best
                 (n=3), and tetra-grams (n=4) would be generated                             performance reaching 0.796 of accuracy and 0.796 of f-score.
                 and used as features.                                                       Experiments with the marital status showed the best results
                    V.      EXPERIMENTS AND RESULTS
                                                                                             reaching 0.766 of accuracy and 0.767 of f-score with the NBM
                                                                                             and character n-grams (of n = [6, 6]). Tests with the personality
    Our experiments were carried out on the corpus described                                 type proved the superiority of NBM again: the highest 0.791
in Section III using the methods and feature types described in                              accuracy and 0.792 f-score was achieved with the character n-
Section IV.                                                                                  grams (of n = [6, 6]). Thus, the Naïve Bayes Multinomial
    We used the implementations of the methods integrated into                               classifier and previously reported feature types would be
the WEKA 3.8 machine learning toolkit1. WEKA [23] allowed                                    recommended for the similar tasks and languages.
both: the extraction of features and selection of the classifier.                                On the contrary, the best previously reported age and gender
    All experiments were performed using stratified10-fold                                   profiling results on the normative Lithuanian language were
cross validation and evaluated with the accuracy (1) and f-score                             achieved with the SVM classifier and lemma bi-grams as the
(2) metrics. The results are considered acceptable and                                       feature type [17]. It is not surprising having in mind that
reasonable if the achieved author profiling accuracy is above                                morphological tools (dealing with the normative texts) were
majority (3) and random (4) baselines.                                                       maximally helpful. Besides, the second best feature type was
                                                                                             also based on the character n-grams. Despite our best method
                                      𝑡𝑝 + 𝑡𝑛                              (1)               achieved slightly higher accuracy compared to the previously
               𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
                                 𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛                                           reported, the direct comparison is hardly possible due to the
                                                                                             very different experimental conditions (datasets and their sizes,
                                       2 ∗ 𝑡𝑝                              (2)               language types, text lengths, etc.).
                𝐹_𝑠𝑐𝑜𝑟𝑒 =
                                 2 ∗ 𝑡𝑝 + 𝑓𝑝 + 𝑓𝑛
                                                                                                In general, the gender detection task is solved for a rather
here tp (true positives), tn (true negatives), fp (false positives), fn (false               big group of languages, reaching ~ 80% and ~ 56.53% of
negatives) denote a number of correctly classified instances ci with ci and cj with
any other cj, incorrectly classified instances ci with any other cj and any other cj         accuracy on the normative English in [4] and [24], respectively;
with ci, respectively                                                                        64.73% on the Spanish blogs in [24] and ~ 82.60% on the Greek
                               max(𝑝𝑖 )                                    (3)               blogs [25]. On the non-normative tweet texts the obtained
                                                                                             accuracies are still surprisingly high reaching, e.g., ~ 98% on
                                                                           (4)               Arabic in [15] and ~ 99% on English in [26]. However, the
                                ∑ 𝑝𝑖2                                                        reported results, especially for the English language, are very
                                  𝑖                                                          controversial (from ~ 56.53% in [24] to even ~ 99% in [26]).
here 𝑝𝑖 denote the probability of the class                                                  The age detection task is also thoroughly researched for many

1                                                                                            2
    Download from: http://www.cs.waikato.ac.nz/ml/weka/downloading.html                        Since the f-score values demonstrate the same trend compared to the
                                                                                             accuracies, we do not present them in the figures.


                                                                                       102
languages, reaching 64.0%, 43.80%, 19.09% on the English
texts [24], [3], [27]; 64.30%, 37.50% on the Spanish [24] [27];
71.3% on the Dutch [28]; 80% on the Chinese [29]. Research
on the personality type is mostly done on the normative English
language [5] and reaches ~ 58.2% of accuracy.
    Hence, the observed results are very different, due to the
different test samples, methods, or chosen languages.


                                                                                                            char4


                                                                                                                           char4


                                                                                                                                          char4


                                                                                                                                                         char1


                                                                                                                                                                        char1
   Due to the very different experimental conditions (different
datasets, used methods and language types) these results are
hardly comparable between; as well as they are hardly
comparable with the results obtained in our research work.


                                                                                                 Fig. 5 Accuracies (in percentage) obtained with different classification
                                                                                             solving education detection task. For the other notations see Fig. 3.
               word1


                               char4


                                                                            alph1
                                              char4


                                                             alph1


                                                                                                            char4


                                                                                                                                                         alph1
                                                                                                                           char4


                                                                                                                                                                        alph1
                                                                                                                                           lex4
     Fig. 3 Accuracies (in percentage) obtained with different classifiers solving
gender detection task (an upper horizontal line represents a majority baseline,
lower – a random baseline). Every column shows the best result obtained with
different feature type: word tokenizer & unigrams denote as word1, alphabetic
tokenizer & unigrams - alph1, n-gram tokenizer & unigrams - lex1, n-gram
tokenizer & tetra-grams - lex4, character n-gram tokenizer& unigrams - char1,                    Fig. 6 Accuracies (in percentage) obtained with different classification
character n-gram tokenizer & tetra-grams - char4.                                            solving marital status detection task. For the other notations see Fig. 3
               lex4


                               char4


                                               lex4


                                                               char1


                                                                               char1


                                                                                                                            char4
                                                                                                             char1


                                                                                                                                            lex1


                                                                                                                                                             lex4


                                                                                                                                                                           lex4

    Fig. 4 Accuracies (in percentage) obtained with different classifiers solving
age detection task. For the other notations see Fig. 3.                                          Fig. 7 Accuracies (in percentage) obtained with different classifiers solving
                                                                                             personality type detection task. For the other notations see Fig. 3.


                                                                                       103
                                                                                         influence of features and dataset sizes.," Human Language
                                                                                         Technologies – The Baltic Perspective, pp. 99-106, 2014.
                                                                                    [7] M. Briedienė, J. Kapočiūtė-Dzikienė, "An authomatic gender
                                                                                        detection from non-normative Lithuanina texts," Ceur-Ws, Kaunas,
                                                                                        2017.
                                                                                    [8] E. Stomatatos, "A Survey of Modern Author," Journal of the
                                                                                        American Society for Information Science and Technology, 2009.
                                                                                    [9] D. Bagnall, "Author identification using multi-headed recurrent,"
                                                                                        PAN 2015, 2015.
                                                                                    [10] S. Sierra, M. Montes-y-Gómez, T. Solorio, F. A. González,
                                                                                         "Convolutional Neural Networks for Author Profiling," Notebook for
                                                                                         PAN at CLEF 2017, 2017.
                                                                                    [11] E. A. Zanaty, "Support Vector Machines (SVMs) versus Multilayer
                                                                                         Perception (MLP) in data classification," Mathematics Dept.,
                                                                                         Computer Science Section, Faculty of Science, Sohag University,
                                                                                         Sohag, Egypt, 2012.
    Fig. 8 The best summarized accuracies (in percentage) for the different
profiling dimensions.                                                               [12] M. Meina, K. Brodzinska, B. Celmer, M. Czokow, M. Patera, J.
                                                                                         Pezacki, M. Wilk, "Ensemble-based classification for author profiling
             VI.     CONCLUSION AND FUTURE WORKS                                         using," Notebook for PAN at CLEF 2013, 2013.
                                                                                    [13] T. Raghunadha Reddy, B. Vishnu Vardhan, P. Vijayapal Reddy,
    In this paper we report the author profiling task results                            "Profile specific Document Weighted approach using a New Term
using short (of only avg. ~ 24 tokens) Lithuanian non-                                   Weighting Measure for Author Profiling," International Journal of
normative texts harvested from the Facebook social network.                              Intelligent Engineering and Systems, pp. 136-146, december 2016.
During our research we investigated the most popular                                [14] F. Rangel, F. Celli, P. Rosso, M. Potthast, B. Stein, W. Daelemans,
supervised machine learning (Naïve Bayes, Naïve Bayes                                    "Overview of the 3rd Author Profiling Task at PAN 2015," 2015.
Multinomial, Support Vector Machine) and similarity-based                           [15] E. AlSukhni, Q. Alequr, "Investigation the Use of Machine Learning
(IBK, kStart) techniques plus various lexical and character                              Algorithms in Detecting Gender of the Arabic Tweet Author," Article
feature types.                                                                           Published in International Journal of Advanced Computer Science and
                                                                                         Applications, 2016.
    The best results on the 1) gender (84.3% of accuracy),                          [16] M. Fatimaa, K. Hasanb, S. Anwara, R. M. A. Nawab, "Multilingual
2) age (52.7%), 3) education (79.6%), 4) marital status                                  author profiling on Facebook," Information Processing &
(76.6%) and 5) personality type (79.1%) dimensions were                                  Management, pp. 886-904, liepa 2017.
achieved with 1) Naïve Bayes Multinomial and character n-                           [17] J. Kapočiūtė-Dzikienė, A. Utka, L. Šarkutė, "Authorship Attribution
grams of n = [6, 7]; 2) Naïve Bayes Multinomial method and                               and Author Profiling of Lithuanian Literary Texts," Proceedings of the
                                                                                         5th Workshop on Balto-Slavic Natural Language Processing, pp. 96-
character n-grams of n = 5; 3) Naïve Bayes Multinomial and                               105, September 2015.
character n-grams of n = 5; 4) Naïve Bayes Multinomial and
                                                                                    [18] C. Cortes, V. Vapnik, "Support-Vector Networks," Machine
character n-grams of n = 6; 5) Naïve Bayes Multinomial                                   Learning, pp. 273–297, 1995.
method and character n-grams of n = 6.
                                                                                    [19] T. Joachims, "Text Categorization with Support Vector Machines:
    In the future research our focus on the non-normative                                Learning with Many Relevant Features," European Conference on
                                                                                         Machine Learning, pp. 137-142, 1998.
Lithuanian texts remains. We are planning to increase our
author profiling corpus and test it on the different deep                           [20] D. D. Lewis, W. A. Gale, "A Sequential Algorithm for Training Text
                                                                                         Classifiers," SIGIR '94 Proceedings of the 17th annual international
learning approaches.                                                                     ACM SIGIR conference on Research and development in information
                                                                                         retrieval , pp. 3-12, July 1994.
                                                                                    [21] D. Aha, D. Kibler, "Instance-based learning algorithms," Machine
                              REFERENCES                                                 Learning, pp. 37–66, 1991.
                                                                                    [22] J. G. Cleary, L. E. Trigg, "K*: An Instance-based Learner Using an
                                                                                         Entropic Distance Measure," In Proceedings of the 12th International
 [1] H. Van Halteren, R. H. Baayen, F. Tweedie, M. Haverkort, A. Neijt,
                                                                                         Conference on Machine Learning, 1995.
     "New Machine Learning Methods Demonstrate the Existence of a
     Human Stylome," Journal of Quantitative Linguistics, 2005.                     [23] 2016. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.
 [2] T. C. Mendenhall, "The Characteristic Curves of Composition,"                  [24] K. Santosh, R. Bansal, M. Shekhar, V. Varma, "Author Profiling:
     Science, 1851.                                                                      Predicting Age and Gender from Blogs," Notebook for PAN at CLEF
                                                                                         2013, 2013.
 [3] J. Schler, M. Koppel, S. Argamon, J. Pennebaker, "Effects of Age and
     Gender on Blogging," American Association for Artificial                       [25] G. K. Mikros, "Authorship Attribution and Gender Identification in
     Intelligence , 2006.                                                                Greek Blogs," Methods and Applications of Quantitative Linguistics,
                                                                                         pp. 21-32, 2012.
 [4] M. Koppel, S. Argamon, A. R. Shimoni, "Automatically Categorizing
     Written Texts by Author Gender," Literary and Linguistic Computing,            [26] Z. Miller, B. Dickinson, W. Hu, "Gender Prediction on Twitter Using
     pp. 401-412, November 2002.                                                         Stream Algorithms with N-Gram Character Features," International
                                                                                         Journal of Intelligence Science, pp. 143-148 , 2012.
 [5] S. Argamon, S. Dawhle, M. Koppel, J. Pennebaker, "Lexical
     Predictors of Personality Type," Joint Annual Meeting of the Interface         [27] J. Marquard, G. Farnadi, G. Vasudevan, M-F. Moens, S. Davalos, A.
     and the Classification Society of North America, June 2005.                         Teredesai, M. De Cock, "Age and Gender Identification in Social
                                                                                         Media," CLEF 2014 working notes; PAN 2014, 2014.
 [6] J. Kapočiūtė-Dzikienė, L. Šarkutė, A. Utka, "Automatic author
     profiling of Lithuanian parliamentary speeches : exploring the


                                                                              104
[28] C. Peersman, W. Daelemans, L. Van Vaerenbergh, "Predicting Age
     and Gender in Online Social Networks," SMUC '11 Proceedings of
     the 3rd international workshop on Search and mining user-generated
     contents , pp. 37-44 , 2010.

[29] Li Chen, Tieyun Qian, Fei Wang, Zhenni You, Qingxi Peng, Ming
     Zhong, "Age Detection for Chinese Users in Weibo," WAIM 2015:
     Web-Age Information Management, 2015.

[30] A. Venckauskas,          A. Karpavicius, R. Damaševičius, R.
     Marcinkevičius, J. Kapočiūte-Dzikiené, and C. Napoli, “Open class
     authorship attribution of Lithuanian Internet comments using one-
     class classifier.” In Federated Conference on Computer Science and
     Information Systems (FedCSIS), pp. 373-382, 2017..


[31] M. Wróbel, J.T. Starczewski, and C. Napoli, “Handwriting
     recognition with extraction of letter fragments”. In International
     Conference on Artificial Intelligence and Soft Computing, pp. 183-
     192, 2017.


                                                                          105