<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>" International Journal of
Intelligent Engineering and Systems</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Automatic Author Profiling from Non-Normative Lithuanian Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Monika Briedienė</string-name>
          <email>monika.briediene@vdu.lt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jurgita Kapočiutė - Dzikienė</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vytautas Magnus University Kaunas</institution>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>99</fpage>
      <lpage>106</lpage>
      <abstract>
        <p>- This paper presents author profiling research done on the Lithuanian texts using automatic machine learning methods. Our research is novel and challenging due to the following reasons: 1) a big number of author profiling dimensions, i.e., gender, age, education, marital status and personality type; 2) very short (avg. ~ 24 tokens) non-normative texts; 3) vocabulary rich highly inflective Lithuanian language. We have performed experimental investigation that resulted in choosing automatic author profiling methods (in particular, classifiers and feature types) that have reached the highest accuracy on the pure texts without any meta-information about their authors. Out of a number of experimentally investigated classifiers using lexical or symbolic features the Naïve Bayes Multinomial method with character n-grams feature type yielded the best performance reaching 84.3%, 52.7%, 79.6%, 76.6%, 79.1% of accuracy in gender, age, education, marital status and personality type detection tasks, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>gender detection</kwd>
        <kwd>age detection</kwd>
        <kwd>education detection</kwd>
        <kwd>marital status detection</kwd>
        <kwd>personality type detection</kwd>
        <kwd>author profiling</kwd>
        <kwd>the non-normative Lithuanian language</kwd>
        <kwd>supervised machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In today’s world, numbers of electronic texts have exceeded
paper texts by several times. However, the vast majority of
these texts are written anonymously or pseudonymously. For
this reason, court analysts, web forum administrators, social
networks supervisors are increasingly facing impersonation,
bullying or harassment, discloser of confidential information,
dissemination of disinformation, and other issues. Uncovering
the exact identity of the person is very complicated and
sometimes unsolvable task, whereas to reveal his/her
metainformation (i.e., demographic features: age, gender, etc.) is
easier, but still very useful. The revealed meta-information that,
e.g., a 50-year-old man is impersonating a 10-year-old girl may
encourage the police to dive more detailed into the data or even
take decisive actions for the criminal offense. The manual</p>
    </sec>
    <sec id="sec-2">
      <title>Internet space monitoring and manual text analysis is hardly possible, because it requires enormous amounts of human resources. Thus, natural language processing technologies become the only solution for tacking similar problems.</title>
    </sec>
    <sec id="sec-3">
      <title>The author profiling experimental investigations confirm that the authors’ characteristics can be determined by analyzing</title>
    </sec>
    <sec id="sec-4">
      <title>Copyright held by the author(s). 99</title>
      <p>
        In general, the identification of an authorship has the long
history dating back to 1887 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but with the Internet era its
popularity gained dramatically. Therefore the author profiling
– responsible for the automatic extraction of the
metainformation about some author (as, e.g., age [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], gender [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
psychological status [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], etc.) – nowadays is an active and
important research area. The author profiling research is mainly
focused on the English language, whereas for the Lithuanian
language it is rather a new subject. The age, gender and political
views profiling tasks are solved using parliamentary transcripts
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; age and gender profiling tasks are solved using the
Lithuanian literary texts [
        <xref ref-type="bibr" rid="ref15">17</xref>
        ]. However, these research works
are done on rather long (having ~ 217 tokens on average) and
normative Lithuanian texts. The non-normative Lithuanian
language (which is the object of research in this paper) is much
more complicated: it is full of out-of-vocabulary words, jargon,
foreign language insertions and neologisms. Besides, it faces an
important problem of diacritics ignorance (where ą, č, ę, ė, į, š,
ų, ū, ž are often replaced with the appropriate ASCII
equivalents). However, the author profiling task on the
nonnormative Lithuanian texts is issued using the gender
dimension only [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Moreover, some sub-tasks of the author’s
profiling on the education, marital status, and personality type
dimensions have never even been solved before using any types
of Lithuanian texts. Consequently, the purpose of this paper is
to fill in the above mentioned gap: i.e., to offer the methods
(classifiers, their parameters, and features types) able to create
the automatic author profiles from the short non-normative
      </p>
    </sec>
    <sec id="sec-5">
      <title>Lithuanian texts (Facebook posts, comments and messages).</title>
    </sec>
    <sec id="sec-6">
      <title>The final goal of this research can be achieved after</title>
      <p>performing the following intermediate tasks: (1) a related work
analysis (see Section II), (2) a construction of the representative
corpus containing non-normative Lithuanian texts (see Section</p>
    </sec>
    <sec id="sec-7">
      <title>III), (3) an analytical selection of the most promising methods</title>
      <p>(see Section IV), (4) a precise experimental evaluation of
selected methods (see Section V). The conclusions
(recommendations) and future research plans for the author
profiling tasks when using short non-normative Lithuanian
texts are in Section VI.</p>
      <p>II.</p>
      <sec id="sec-7-1">
        <title>RELATED WORKS</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>There are many methods used to deal with the author</title>
      <p>profiling task. All existing approaches can be grouped
according to the following criteria: the percentage of training
instances in the dataset, an amount of information they provide,
(i.e., a recognition-training feedback) and the nature of
knowledge. Based on these criteria, the approaches are [8]:</p>
    </sec>
    <sec id="sec-9">
      <title>Rule-based, Unsupervised Machine Learning, Supervised</title>
    </sec>
    <sec id="sec-10">
      <title>Machine Learning, and Similarity-Based.</title>
    </sec>
    <sec id="sec-11">
      <title>The obsolete rule-based methods use rules that have been</title>
      <p>constructed by human-experts. The development process itself
is very difficult and requires linguistic competence. In addition,
rules are created for the specific solution, therefore are hardly
transferable to the new areas.</p>
      <p>Unsupervised machine learning (or clustering methods) is
chosen when no meta-information (i.e., no training instances) is
provided. Examples of the text are grouped according to their
similarity. The main disadvantage of these methods is that their
grouping does not necessarily correspond an imaginary
grouping of a human. Usually because of their low accuracy,
these methods are not popular in author profiling tasks.</p>
      <p>
        If texts are supplemented with the necessary
metainformation about the particular author characteristic (so-called
class) the supervised machine learning is one of two best
choices. The stylistic, lexical or symbolic text characteristics
(i.e., so-called features) are presented as the input. The classifier
summarizes training information and creates a model as its
output. This model afterwards can be used for the author
profiling of unseen texts. A main disadvantage of all supervised
machine learning methods is that they require a comprehensive
and representative training data to create a reliable and
comprehensive model. The advantage of supervised methods is
that they can be flexibly adjusted to the new tasks or areas by
adding new text samples and retraining the classifier. The deep
learning methods [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] (that became extremely popular
recently for many text classification tasks) are also
representatives of this group. The popularity of the Neural
      </p>
    </sec>
    <sec id="sec-12">
      <title>Networks (Convolutional [10], Recurrent [9], etc.) is also</title>
      <p>
        growing recently. Such popularity has also been driven by the
technical progress: it has led to the faster computing and
processing huge amounts of data. The deep learning is used for
the author profiling [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] and authorship attribution [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] tasks.
Despite the deep learning methods are successfully applied in
many natural language processing tasks, on the smaller datasets
(as in our paper) they underperform the other supervised
machine learning approaches, such as Support Vector Machines
or Naïve Bayes Multinomial [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ]. The similarity-based
approaches (often researched and discussed separately) are very
similar to the supervised machine learning approaches by their
nature. The only difference is that instead of creating a model,
they preserve all training instances and use similarity measures
to determine to which of available classes some incoming
unseen instance is the most similar. An advantage of
similaritybased methods is that they keep the entire training set; so the
information is not lost during generalization.
      </p>
    </sec>
    <sec id="sec-13">
      <title>The majority of research done for solving the author</title>
      <p>
        profiling tasks involve these popular supervised approaches
(e.g., Naïve Bayes [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ], Naïve Bayes Multinomial [13], Support
      </p>
    </sec>
    <sec id="sec-14">
      <title>Vector Machines [14]) and similarity-based (e.g., k-Nearest</title>
    </sec>
    <sec id="sec-15">
      <title>Neighbor) or the comparative experiments proving the</title>
      <p>superiority of Naïve Bayes Multinomial and Support Vector</p>
    </sec>
    <sec id="sec-16">
      <title>Machines (as in [15]). Since, it is proved that these approaches are not only the most popular, but the most accurate for the author profiling tasks, further we will focus only on these types of methods.</title>
    </sec>
    <sec id="sec-17">
      <title>When analyzing the Lithuanian non-normative texts, we</title>
      <p>follow the recommendations formulated for the other
languages. However, a language factor itself should also be
taken into account. The Lithuanian language (used in our
research) has rich vocabulary, morphology, word derivation
system and relatively free-word order in a sentence. Despite the
Lithuanian language (especially non-normative) is rather
complicated, some of previously mentioned language
characteristics do not necessary have to complicate our solving
tasks, i.e., it might occur that our investigated groups of
individuals are bind to the very different, but very
representative non-normative sentence structures or
vocabularies.</p>
      <p>III.</p>
    </sec>
    <sec id="sec-18">
      <title>CORPUS</title>
      <p>
        Unfortunately, the author profiling benchmark corpora are
not available on the Internet for the non-normative Lithuanian
language, therefore in this research we are using the corpus that
was specifically created for our tasks. The corpus is composed
of unprocessed posts (without any appearance of the third party
texts) manually harvested from the Facebook social network in
the period of 2016-2017. The author profiling research for the
other languages mostly focuses on the Twitter [
        <xref ref-type="bibr" rid="ref13">15</xref>
        ], but not
Facebook [
        <xref ref-type="bibr" rid="ref14">16</xref>
        ] texts. It is due to the convenient APIs that help
crawling tweets; besides, in some countries Twitter is more
popular than Facebook. In our work we have chosen Facebook
social network due to its popularity in Lithuania and
opportunity to store more demographic characteristics such as
education, marital status (not only age or gender) reported by
the users themselves.
      </p>
      <p>
        Our corpus contains posts, comments and messages of 200
individuals (for statistics see Figure 1), one text per person (to
avoid the authorship attribution impact on the author profiling
results). 102 and 98 texts belong to women and men,
respectively (see Gender column in Figure 2). The youngest
participant is 18 years old, the oldest – 78, the mean age of
respondents is ~ 36.9. Respondents are divided into six age
groups (see Age column in Figure 2). The selected grouping is
used in surveys of psychologists, in the social studies, in the
largest European and Lithuanian data archives. Besides, it is
also used in the similar research works [
        <xref ref-type="bibr" rid="ref15">17</xref>
        ], making our results
more comparable to the previously reported for the Lithuanian
language.
      </p>
    </sec>
    <sec id="sec-19">
      <title>The education level of 105 and 95 respondents is higher and</title>
      <p>secondary, respectively (see Education column in Figure 2).
114 and 86 individuals claimed they are married and single,
respectively (see Marital status column in Figure 2). 112 and
88 people attributed themselves as extrovert and introvert,
respectively (see Personality type column in Figure 2).</p>
    </sec>
    <sec id="sec-20">
      <title>The corpus consists of 4.830 tokens (including in-the</title>
      <p>vocabulary and out-the-vocabulary words, numbers, and
nonnormative “words” with embedded digits or punctuation) in
total. The shortest text (without symbols and emoticons) is only</p>
    </sec>
    <sec id="sec-21">
      <title>2 tokens length, the longest – 161, the average length per text is only ~ 24 tokens. Posts Comments</title>
    </sec>
    <sec id="sec-22">
      <title>The methodological part covers two main directions: 1) the</title>
      <p>proper selection of the classifier and 2) the proper selection of
the feature type.</p>
      <p>
        To come up with the very best, we have analyzed the
following classifiers of these groups:
 Supervised machine learning. A representative of this type
is the Support Vector Machine (SVM) method (introduced
by Cortes C. and Vapnik V. in 1995 [
        <xref ref-type="bibr" rid="ref16">18</xref>
        ]). It is a
discriminatory instance-based approach, currently
considered as one of the most popular text classification
techniques. The method effectively copes with the huge
number of features, sparse feature vectors and does not
perform an aggressive feature selection, which may result
in the loss of valuable information and accuracy [
        <xref ref-type="bibr" rid="ref17">19</xref>
        ].
      </p>
    </sec>
    <sec id="sec-23">
      <title>Another representatives are Naïve Bayes (NB) and its</title>
      <p>
        modification Naïve Bayes Multinomial (NBM)
(introduced by Lewis D. D. and Gale W. A. in 1994 [
        <xref ref-type="bibr" rid="ref18">20</xref>
        ]).
These techniques are generative profile-based approaches,
often chosen due to their simplicity and sufficiently high
accuracy. The NB assumption about the feature
independence allows each parameter to be learned
separately; these methods work especially well when a
number of features having equal significance is high; they
are fast and do not require large data storage resources.
      </p>
    </sec>
    <sec id="sec-24">
      <title>Moreover, Bayesian methods often play a baseline role in the evaluation.</title>
      <p>
         Similarity-based. A representative of this type is the IBK
method (introduced by Aha D. and Kibler D. in 1991 [
        <xref ref-type="bibr" rid="ref19">21</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-25">
      <title>This nearest neighbors’ classifier chooses the appropriate k</title>
      <p>
        value, based on the k-time cross-check after the distance
evaluation (between a testing instance and all samples in
the training set). Another representative is Kstar method
(introduced by Cleary J. G. and Trigg L. E. in 1995 [
        <xref ref-type="bibr" rid="ref20">22</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-26">
      <title>On the contrary to IBK, Kstar calculates not a distance measure, but a similarity function. It differs from the other approaches of this type, because uses the entropy-based distance function. These two last-mentioned methods store</title>
      <sec id="sec-26-1">
        <title>Gender Age</title>
      </sec>
      <sec id="sec-26-2">
        <title>Education</title>
      </sec>
      <sec id="sec-26-3">
        <title>Marital status</title>
      </sec>
      <sec id="sec-26-4">
        <title>Personality type</title>
        <p>all available instances; therefore, are prevented from the
information loss during training.</p>
      </sec>
    </sec>
    <sec id="sec-27">
      <title>Our second research direction involved the proper selection of the feature type. In our experiments we have explored:</title>
      <p>
</p>
      <p>Lexical feature types: token uni-grams (n=1)
(individual tokens) and token tetra-grams (n=4)
(sequences of 4 tokens in a window sliding one
token at the time). For instance, from the phrase
“author profiling from the Lithuanian texts” it
would
be
generated
6
unigrams:
“author”,
“profiling”, “from”, “the”, “Lithuanian”, “texts”
and 3 tetra-grams “author profiling from the”,
“profiling
from
the</p>
    </sec>
    <sec id="sec-28">
      <title>Lithuanian”, “from the</title>
    </sec>
    <sec id="sec-29">
      <title>Lithuanian texts”.</title>
      <p>Character features, in particular, character n-grams
similarly to token n-grams are sequences of items,
but instead of tokens they contain characters. For
instance, from the phrase “author profiling” it
would be generated the following document-level
character 4-grams: “auth”, “utho”, “thor”, “hor_”,
“or_p”,
“r_pr”,
etc.</p>
      <p>
        (where
“_”
marks
the
whitespace). It is important to mention that a value
of n not necessary has to be fixed. E.g., with the
interval n = [
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ] all bi-grams (n=2), trigrams
(n=3), and tetra-grams (n=4) would be generated
and used as features.
      </p>
      <p>V.</p>
      <sec id="sec-29-1">
        <title>EXPERIMENTS AND RESULTS</title>
      </sec>
    </sec>
    <sec id="sec-30">
      <title>Our experiments were carried out on the corpus described</title>
      <p>in Section III using the methods and feature types described in</p>
    </sec>
    <sec id="sec-31">
      <title>Section IV.</title>
    </sec>
    <sec id="sec-32">
      <title>We used the implementations of the methods integrated into</title>
      <p>
        the WEKA 3.8 machine learning toolkit1. WEKA [
        <xref ref-type="bibr" rid="ref21">23</xref>
        ] allowed
both: the extraction of features and selection of the classifier.
      </p>
    </sec>
    <sec id="sec-33">
      <title>All experiments were performed using stratified10-fold cross validation and evaluated with the accuracy (1) and f-score (2) metrics.</title>
      <p>The
results
are
considered
acceptable
and
reasonable if the achieved author profiling accuracy is above
majority (3) and random (4) baselines.
here tp (true positives), tn (true negatives), fp (false positives), fn (false
negatives) denote a number of correctly classified instances ci with ci and cj with
any other cj, incorrectly classified instances ci with any other cj and any other cj
with ci, respectively
here   denote the probability of the class
(1)
(2)
(3)
(4)</p>
      <p>
        Our preliminary experiments have involved the selection of
the most accurate classification technique when using word
tokenizer with unigrams (n=1) (denoted as word1), n-gram
tokenizer with unigrams (n=1) (lex1) and tetra-grams (n=4)
(lex4), alphabetic tokenizer
with
unigrams (n=1) (alph1),
character n-gram tokenizer with unigrams (n=1) (char1) and
tetra-grams (n=4) (char4) (the best results are presented in
and NBM methods and character n-grams2. These methods also
demonstrated the best performance in the author profiling tasks
on the morphologically complex Arabic language [
        <xref ref-type="bibr" rid="ref13">15</xref>
        ].
      </p>
    </sec>
    <sec id="sec-34">
      <title>In our later experiments we have performed the tuning of</title>
      <p>the character n-gram parameter n by keeping the classifier
parameter stable and equal to SVM or NBM (because these
classifiers demonstrated the best performance in the classifier
selection experiments). The obtained results with the different
author profiling dimensions are reported in Figure 8.</p>
      <p>
        The overall best results (reaching 0.843 of the accuracy and
0.843 of f-score) on the short non-normative Lithuanian texts in
the gender detection task were achieved with the NBM and
character n-grams of n = [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] as the feature type. The best
results reaching 0.527 of accuracy and 0,473 of f-score on the
age dimension were achieved with the NBM and character
ngrams (of n = [
        <xref ref-type="bibr" rid="ref5 ref5">5, 5</xref>
        ]). In the education detection NBM and
character n-grams (of n = [
        <xref ref-type="bibr" rid="ref5 ref5">5, 5</xref>
        ]) demonstrated the best
performance reaching 0.796 of accuracy and 0.796 of f-score.
Experiments with the marital status showed the best results
reaching 0.766 of accuracy and 0.767 of f-score with the NBM
and character n-grams (of n = [
        <xref ref-type="bibr" rid="ref6 ref6">6, 6</xref>
        ]). Tests with the personality
type proved the superiority of NBM again: the highest 0.791
accuracy and 0.792 f-score was achieved with the character
ngrams (of n = [
        <xref ref-type="bibr" rid="ref6 ref6">6, 6</xref>
        ]). Thus, the Naïve Bayes Multinomial
classifier and previously reported feature types would be
recommended for the similar tasks and languages.
      </p>
      <p>On the contrary, the best previously reported age and gender
profiling</p>
      <p>
        results on the normative Lithuanian language were
achieved with the SVM classifier and lemma bi-grams as the
feature type [
        <xref ref-type="bibr" rid="ref15">17</xref>
        ]. It is not surprising having in mind that
morphological tools (dealing with the normative texts) were
maximally helpful. Besides, the second best feature type was
also based on the character n-grams. Despite our best method
achieved slightly higher accuracy compared to the previously
reported, the direct comparison is hardly possible due to the
very different experimental conditions (datasets and their sizes,
language types, text lengths, etc.).
      </p>
    </sec>
    <sec id="sec-35">
      <title>In general, the gender detection task is solved for a rather</title>
      <p>
        big group of languages, reaching ~ 80% and ~ 56.53% of
accuracy on the normative English in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref22">24</xref>
        ], respectively;
64.73% on the Spanish blogs in [
        <xref ref-type="bibr" rid="ref22">24</xref>
        ] and ~ 82.60% on the Greek
blogs [
        <xref ref-type="bibr" rid="ref23">25</xref>
        ]. On the non-normative tweet texts the obtained
accuracies are still surprisingly high reaching, e.g., ~ 98% on
      </p>
    </sec>
    <sec id="sec-36">
      <title>Arabic in [15] and ~ 99% on English in [26]. However, the</title>
      <p>
        reported results, especially for the English language, are very
controversial (from ~ 56.53% in [
        <xref ref-type="bibr" rid="ref22">24</xref>
        ] to even ~ 99% in [
        <xref ref-type="bibr" rid="ref24">26</xref>
        ]).
The age detection task is also thoroughly researched for many
1 Download from: http://www.cs.waikato.ac.nz/ml/weka/downloading.html
2 Since the f-score values demonstrate the same trend compared to the
accuracies, we do not present them in the figures.
languages, reaching 64.0%, 43.80%, 19.09% on the English
texts [
        <xref ref-type="bibr" rid="ref22">24</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref25">27</xref>
        ]; 64.30%, 37.50% on the Spanish [
        <xref ref-type="bibr" rid="ref22">24</xref>
        ] [
        <xref ref-type="bibr" rid="ref25">27</xref>
        ];
71.3% on the Dutch [
        <xref ref-type="bibr" rid="ref26">28</xref>
        ]; 80% on the Chinese [
        <xref ref-type="bibr" rid="ref27">29</xref>
        ]. Research
on the personality type is mostly done on the normative English
language [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and reaches ~ 58.2% of accuracy.
      </p>
    </sec>
    <sec id="sec-37">
      <title>Hence, the observed results are very different, due to the</title>
      <p>different test samples, methods, or chosen languages.</p>
      <p>Due to the very different experimental conditions (different
datasets, used methods and language types) these results are
hardly comparable between; as well as they are hardly
comparable with the results obtained in our research work.
1
d
r
o
w
4
x
e
l
4
r
a
h
c
4
r
a
h
c
4
r
a
h
c
4
x
e
l
1
h
p
l
a
1
h
p
l
a
1
r
a
h
c
1
r
a
h
c
4
r
a
h
c
1
r
a
h
c
4
r
a
h
c
4
r
a
h
c
4
r
a
h
c
4
r
a
h
c
4
x
e
l
1
r
a
h
c
1
h
p
l
a
1
r
a
h
c
1
h
p
l
a
1
x
e
l
4
x
e
l
4
x
e
l</p>
    </sec>
    <sec id="sec-38">
      <title>In this paper we report the author profiling task results</title>
      <p>using short (of only avg. ~ 24 tokens) Lithuanian
nonnormative texts harvested from the Facebook social network.</p>
    </sec>
    <sec id="sec-39">
      <title>During our research we investigated the most popular</title>
      <p>supervised machine learning (Naïve Bayes, Naïve Bayes</p>
    </sec>
    <sec id="sec-40">
      <title>Multinomial, Support Vector Machine) and similarity-based</title>
      <p>(IBK, kStart) techniques plus various lexical and character
feature types.</p>
      <p>
        The best results on the 1) gender (84.3% of accuracy),
2) age (52.7%), 3) education (79.6%), 4) marital status
(76.6%) and 5) personality type (79.1%) dimensions were
achieved with 1) Naïve Bayes Multinomial and character
ngrams of n = [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]; 2) Naïve Bayes Multinomial method and
character n-grams of n = 5; 3) Naïve Bayes Multinomial and
character n-grams of n = 5; 4) Naïve Bayes Multinomial and
character n-grams of n = 6; 5) Naïve Bayes Multinomial
method and character n-grams of n = 6.
      </p>
    </sec>
    <sec id="sec-41">
      <title>In the future research our focus on the non-normative</title>
    </sec>
    <sec id="sec-42">
      <title>Lithuanian texts remains. We are planning to increase our author profiling corpus and test it on the different deep learning approaches. REFERENCES</title>
      <p>[8] E. Stomatatos, "A Survey of Modern Author," Journal of the</p>
      <sec id="sec-42-1">
        <title>American Society for Information Science and Technology, 2009.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>H. Van Halteren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Baayen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tweedie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haverkort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neijt</surname>
          </string-name>
          ,
          <article-title>"New Machine Learning Methods Demonstrate the Existence of a Human Stylome,"</article-title>
          <source>Journal of Quantitative Linguistics</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Mendenhall</surname>
          </string-name>
          ,
          <article-title>"The Characteristic Curves of Composition,"</article-title>
          <source>Science</source>
          ,
          <year>1851</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Effects of Age and Gender on Blogging," American Association for Artificial Intelligence</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Shimoni</surname>
          </string-name>
          ,
          <article-title>"Automatically Categorizing Written Texts by Author Gender,"</article-title>
          <source>Literary and Linguistic Computing</source>
          , pp.
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          ,
          <year>November 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dawhle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <article-title>"Lexical Predictors of Personality Type," Joint Annual Meeting of the Interface and the Classification Society of North America</article-title>
          ,
          <year>June 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapočiūtė-Dzikienė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Šarkutė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Utka</surname>
          </string-name>
          ,
          <article-title>"Automatic author profiling of Lithuanian parliamentary speeches : exploring the</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Briedienė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapočiūtė-Dzikienė</surname>
          </string-name>
          ,
          <article-title>"An authomatic gender detection from non-normative Lithuanina texts," Ceur-</article-title>
          <string-name>
            <surname>Ws</surname>
          </string-name>
          , Kaunas,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bagnall</surname>
          </string-name>
          ,
          <article-title>"Author identification using multi-headed recurrent,"</article-title>
          <source>PAN</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sierra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Montes-y-</article-title>
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>F. A.</given-names>
          </string-name>
          <string-name>
            <surname>González</surname>
          </string-name>
          ,
          <article-title>"Convolutional Neural Networks for Author Profiling,"</article-title>
          <source>Notebook for PAN at CLEF</source>
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Zanaty</surname>
          </string-name>
          ,
          <article-title>"Support Vector Machines (SVMs) versus Multilayer Perception (MLP) in data classification,"</article-title>
          <source>Mathematics Dept</source>
          ., Computer Science Section, Faculty of Science, Sohag University, Sohag, Egypt,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Meina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Brodzinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Celmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Czokow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pezacki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wilk</surname>
          </string-name>
          ,
          <article-title>"Ensemble-based classification for author profiling using,"</article-title>
          <source>Notebook for PAN at CLEF</source>
          <year>2013</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Overview of the 3rd Author Profiling Task at PAN</source>
          <year>2015</year>
          ,
          <article-title>"</article-title>
          2015.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>AlSukhni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Alequr</surname>
          </string-name>
          ,
          <article-title>"Investigation the Use of Machine Learning Algorithms in Detecting Gender of the Arabic Tweet Author,"</article-title>
          <source>Article Published in International Journal of Advanced Computer Science and Applications</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fatimaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hasanb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anwara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M. A.</given-names>
            <surname>Nawab</surname>
          </string-name>
          ,
          <article-title>"Multilingual author profiling on Facebook,"</article-title>
          <source>Information Processing &amp; Management</source>
          , pp.
          <fpage>886</fpage>
          -
          <lpage>904</lpage>
          , liepa
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapočiūtė-Dzikienė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Utka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Šarkutė</surname>
          </string-name>
          ,
          <article-title>"Authorship Attribution and Author Profiling of Lithuanian Literary Texts,"</article-title>
          <source>Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing</source>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>"Support-Vector Networks,"</article-title>
          <source>Machine Learning</source>
          , pp.
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>"Text Categorization with Support Vector Machines: Learning with Many Relevant Features,"</article-title>
          <source>European Conference on Machine Learning</source>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [20]
          <string-name>
            <surname>D. D. Lewis</surname>
            ,
            <given-names>W. A.</given-names>
          </string-name>
          <string-name>
            <surname>Gale</surname>
          </string-name>
          ,
          <article-title>"A Sequential Algorithm for Training Text Classifiers,"</article-title>
          <source>SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>July 1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Aha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kibler</surname>
          </string-name>
          ,
          <article-title>"Instance-based learning algorithms,"</article-title>
          <source>Machine Learning</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Cleary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Trigg</surname>
          </string-name>
          ,
          <article-title>"K*: An Instance-based Learner Using an Entropic Distance Measure,"</article-title>
          <source>In Proceedings of the 12th International Conference on Machine Learning</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [23]
          <year>2016</year>
          . [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>"Author Profiling: Predicting Age and Gender from Blogs,"</article-title>
          <source>Notebook for PAN at CLEF</source>
          <year>2013</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Mikros</surname>
          </string-name>
          ,
          <article-title>"Authorship Attribution and Gender Identification in Greek Blogs," Methods and Applications of Quantitative Linguistics</article-title>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dickinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>"Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features,"</article-title>
          <source>International Journal of Intelligence Science</source>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>148</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Marquard</surname>
          </string-name>
          , G. Farnadi, G. Vasudevan,
          <string-name>
            <surname>M-F. Moens</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Davalos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Teredesai</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Cock</surname>
          </string-name>
          ,
          <article-title>"Age and Gender Identification in Social Media,"</article-title>
          <source>CLEF 2014 working notes; PAN</source>
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peersman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van Vaerenbergh</surname>
          </string-name>
          ,
          <article-title>"Predicting Age and Gender in Online Social Networks,"</article-title>
          <source>SMUC '11 Proceedings of the 3rd international workshop on Search and mining user-generated contents</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Li</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Tieyun Qian, Fei Wang, Zhenni You, Qingxi Peng, Ming Zhong,
          <article-title>"Age Detection for Chinese Users in Weibo,"</article-title>
          <source>WAIM</source>
          <year>2015</year>
          :
          <string-name>
            <surname>Web-Age Information</surname>
            <given-names>Management</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Venckauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Damaševičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marcinkevičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapočiūte-Dzikiené</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , “
          <article-title>Open class authorship attribution of Lithuanian Internet comments using oneclass classifier</article-title>
          .”
          <source>In Federated Conference on Computer Science and Information Systems (FedCSIS)</source>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>382</lpage>
          ,
          <year>2017</year>
          ..
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wróbel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.T.</given-names>
            <surname>Starczewski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , “
          <article-title>Handwriting recognition with extraction of letter fragments”</article-title>
          .
          <source>In International Conference on Artificial Intelligence and Soft Computing</source>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>