<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An automatic gender detection from non-normative Lithuanian texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Monika Briedienė</string-name>
          <email>monika.briediene@fc.vdu.lt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vytautas Magnus University K.</institution>
          <addr-line>Donelaičio 58, LT-44248, Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vytautas Magnus University K.</institution>
          <addr-line>Donelaičio 58, LT-44248, Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <fpage>75</fpage>
      <lpage>79</lpage>
      <abstract>
        <p>-This paper describes the gender detection research done on Lithuanian texts using automatic machine learning methods. The main contribution of our work is investigations done namely on the very short (avg. ~ 39 tokens) non-normative texts. With this paper we analyze a fundamental problem: how to choose automatic methods (in particular, classifiers and feature types) that could achieve the highest accuracy in our solving author profiling task (when the short pure text itself is the only evidence used for determining the author's meta-information). The related research analysis helped us to select the methods which demonstrated encouraging results on the other languages and to apply them on the Lithuanian dataset. Out of a number of experimentally investigated classifiers with lexical or symbolic features the Naïve Bayes Multinomial method with character ngrams (of n = [1, 5]) feature type yielded the best performance reaching 83.6% of the accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>gender detection</kwd>
        <kwd>author profiling</kwd>
        <kwd>non-normative Lithuanian language</kwd>
        <kwd>supervised machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>Due to the constant increase of electronic texts the various
natural language processing works become especially relevant.
However, lots of these texts are written anonymously or
pseudonymously, therefore court linguistic analysts,
administrators of Internet forums, supervisors of social
networks more and more often face such problems as
impersonation, bullying or harassment, discloser of
confidential information, dissemination of disinformation, etc.
Although to disclose an identity of a particular person
sometimes is rather difficult, the meta-information (i.e.,
demographic characteristics: age, gender, etc.) also may
provide some clues: e.g. a system detects that a 50 years old
man is impersonating a 12 years old girl and encourages the
police to dive more into details or even to take the decisive
actions in finding a criminal.</p>
      <p>
        Researchers confirm that the authors’ characteristics can
be determined during an analysis of the text style. It is
possible due to an existing phenomenon of a human stylome
(an analogue of a genome) which enables each person to
formulate sentences and to express their thoughts in special
unique ways [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Similarly, a number of studies prove this
phenomenon occurs not only in the style of individuals, but
also in the style of their groups, sharing the same demographic
      </p>
    </sec>
    <sec id="sec-2">
      <title>Copyright © 2017 held by the authors 75</title>
      <p>characteristics (as age, gender or social status) or
psychological state.</p>
      <p>
        In general, the authorship identification has a long history
dating back to 1887 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but the Internet era opened a gate to
even greater popularity for it. Due to it the author profiling –
responsible for the automatic extraction of the
metainformation about some author (as, e.g., age [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], gender [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
psychological status [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], etc.) – nowadays is an active research
area. The author profiling research is mainly focused on the
English language, whereas for the Lithuanian it is rather a new
topic. Moreover, some author profiling sub-tasks (as, e.g., the
gender detection using non-normative Lithuanian texts) have
never been solved before. Consequently, an aim of this paper
is to fill the previously mentioned gap: i.e., to explore the
methods on short non-normative Lithuanian texts (Facebook
posts, comments and messages) and to formulate the
recommendations (about classifiers, their parameters and
features types) for the automatic gender detection task.
      </p>
      <p>The ultimate goal of this research can be achieved by
performing the following intermediate tasks: (1) a related
work analysis (see Section II), (2) a construction of the
representative corpus containing non-normative Lithuanian
texts (see Section III), (3) an analytical selection of the most
promising methods (see Section IV), (4) a precise
experimental evaluation of selected methods (see Section V)
(5) conclusions (recommendations) for the gender detection
when using short non-normative Lithuanian texts and our
further research plans (see Section VI).</p>
      <p>II.</p>
      <p>
        All existing author profiling approaches can be grouped
according to the following criteria: a percentage of training
instances in the dataset, an amount of information they
provide, (i.e., a recognition-training feedback) and the nature
of knowledge. According to these criteria the approaches are
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: Rule-based, Unsupervised Machine Learning, Supervised
Machine Learning and Similarity-Based.
      </p>
      <p>The obsolete rule-based approaches use rules manually
constructed by humans. The development process itself is very
laborious and requires linguistic expertise. Moreover, created
rules are tied to that specific solving problem, therefore are
hardly transferable to new domains.</p>
      <p>Unsupervised machine learning (or clustering methods) is
selected when no meta-information is provided. The text
samples are grouped according to the similarity between them.
A main drawback of these methods is that their grouping does
not necessarily correspond an imaginary grouping by a human.
Mostly due to the very low accuracy these methods are not
among the most popular choices in any author profiling tasks.</p>
      <p>If texts are supplied with the necessary meta-information
about the certain author characteristic (so-called class) the
supervised machine learning is one of two best choices. The
stylistic, lexical or symbolic text characteristics (extracted
from the training instances) are provided as an input for a
classifier. It generalizes all input information and produces a
model as an output. This created model afterwards can be used
for the author profiling of unseen texts. A main drawback of
all supervised machine learning methods is that they require a
comprehensive and representative dataset to create an
exhaustive and robust model. An advantage is that the method
can be flexibly adjusted to new tasks or domains: after adding
new text samples the classifier can be easily retrained. The
similarity-based approaches are very similar to the supervised
machine learning by their nature. An only difference is that
instead of creating the model they memorize and store all
training instances and use similarity measures to determine to
which of available classes some incoming instance is the most
similar. An advantage of similarity-based methods is that they
store the entire training set; therefore no information is lost
during its generalization. Since both supervised machine
learning and similarity-based approaches are the most
accurate, they are the most popular for the various author
profiling tasks. This important observation narrows down our
research area to these approaches only.</p>
      <p>
        The research done on various languages usually involves
the investigation of these popular approaches for supervised
machine learning (e.g., Naïve Bayes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Naïve Bayes
Multinomial [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Support Vector Machines [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) and
similaritybased (e.g., k-Nearest Neighbor) or the comparative
experiments proving the superiority of Naïve Bayes
Multinomial and Support Vector Machines (as in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]).
      </p>
      <p>When investigating the Lithuanian non-normative texts we
considered recommendations formulated for the other
languages. However, a language factor itself is also very
important, therefore must be taken into account as well. The
Lithuanian language (that we are coping in this research) is
rich in the vocabulary and morphology, has the rich word
derivation system and the relatively free-word order in a
sentence. Despite the Lithuanian language is rather
complicated, some of previously mentioned language
characteristics do not necessary complete the solving problem,
i.e., it might occur that our investigated groups of individuals
are bind to the very different sentence structures or
vocabulary.</p>
      <p>
        In fact the gender detection task for the Lithuanian
language is not absolutely new: it has been solved using the
supervised machine learning methods [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, these
authors used rather long normative texts (having averagely
~217 tokens in each). Whereas the non-normative Lithuanian
language (which is the object of research in this paper) is
notably different: it is full of out-of-vocabulary words, jargon,
foreign language insertions and neologisms. Moreover, the
non-normative Lithuanian faces an important problem of
diacritics ignorance (where ą, č, ę, ė, į, š, ų, ū, ž are often
replaced with the appropriate ASCII equivalents). Hence, in
this research we are planning to check how much the accuracy
is affected by a shortness of texts and a type of the language.
      </p>
      <p>III.</p>
    </sec>
    <sec id="sec-3">
      <title>CORPUS</title>
      <p>A gender detection task was solved using the specifically
prepared corpus of non-normative Lithuanian language texts.
The corpus was composed of original posts (without any
appearance of third party texts) harvested from the Facebook
social network in October, 2016. It contains posts, comments
and messages of 70 persons (for statistics see Figure 1) (32
and 38 texts belong to women and men, respectively (see
Figure 2)). The youngest participant is 18 years old, the oldest
– 77, the mean age of respondents is ~33.8. 43 and 27 people
indicated that their level of education higher and secondary,
respectively. 33 and 37 individuals claimed they are married
and unmarried, respectively.</p>
      <p>The corpus consists of 2.729 tokens in total1 (of which
1.433 are written by men and 1.296 by women) (see Figure
3)). The shortest text (without symbols and emoticons) is only
4 tokens length, the longest – 161, the average length of texts
is ~39 tokens.</p>
      <sec id="sec-3-1">
        <title>Posts</title>
      </sec>
      <sec id="sec-3-2">
        <title>Comments</title>
      </sec>
      <sec id="sec-3-3">
        <title>Messages</title>
        <p>Fig. 2 A percentage of texts in our corpus written by men and women
1 It is important to notice, that instead of words we focus on tokens in this
work. Besides regular words, tokens also include out-of-vocabulary words,
numbers, and non-normative “words” with embedded digits or punctuation
marks.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Words of women</title>
      </sec>
      <sec id="sec-3-5">
        <title>Words of men 53% 47%</title>
        <p>proper selection of the classifier and the proper selection of
the feature type.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>To come up with the very best, we investigated the following classifiers for:</title>
      <p></p>
      <p>Supervised machine learning. A representative of this type
is
is
the</p>
    </sec>
    <sec id="sec-5">
      <title>Support</title>
      <p>Vector</p>
      <p>
        Machine
(SVM)
method
(introduced by Cortes C. and Vapnik V. in 1995 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). It
a
discriminatory
case-based
approach,
currently
considered
as
the
most
popular text
classification
technique. The method effectively copes with the huge
number of features, sparse feature vectors and does not
perform an aggressive feature selection, which may result
in the loss of valuable information and accuracy [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Another representatives are Naïve Bayes (NB) and its
modification
      </p>
      <p>Naïve</p>
      <p>Bayes</p>
      <p>
        Multinomial
(NBM)
(introduced by Lewis D. D. and Gale W. A. in 1994 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]).
These techniques are generative profile-based approaches,
often chosen due to their simplicity. The NB assumption
about the feature independence allows each parameter to
be learned separately; these methods work especially well
when a number of features having equal significance is
high; they are fast and do not require large data storage
resources.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Moreover, Bayesian methods often play a baseline role in the evaluation.</title>
      <p></p>
      <p>Similarity-based. A representative of this type is the IBK
method (introduced by Aha D. and Kibler D. in 1991</p>
      <p>
        nearest neighbors’ classifier chooses the
appropriate k value, based on the k-time cross-check after
the distance evaluation (between a testing instance and all
samples in the training set).Another representative is
Kstar method (introduced by Cleary J. G. and Trigg L. E.
in 1995 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]). On the contrary to IBK, Kstar calculates
not a distance measure, but a similarity function. It differs
from the other approaches of this type, because uses the
entropy-based
distance
function.
      </p>
      <p>These
two
lastmentioned methods store all available instances; therefore
are prevented from the information loss during training.</p>
      <p>The second direction involved the proper selection of the
feature type. In our experiments we investigated:

(sequences of 4 tokens in a window sliding one
token at the time). For instance, from the phrase
“gender detection from the Lithuanian texts” it
would
be
generated
6
unigrams:
“gender”,
“detection”, “from”, “the”, “Lithuanian”, “texts”
and 3 tetra-grams “gender detection from the”,
“detection
from
the</p>
    </sec>
    <sec id="sec-7">
      <title>Lithuanian”, “from the</title>
    </sec>
    <sec id="sec-8">
      <title>Lithuanian texts”.</title>
      <p>
        Character features, in particular, character
ngrams similarly to token n-grams are sequences of
items,
but
instead
of
tokens
they
contain
characters. For instance, from the phrase “gender
detection” it would be generated the following
4grams: “gend”, “ende”, “nder”, “der_”, “er_d”,
“r_de”, etc. (where “_” denotes the whitespace). It
is important to mention that a value of n not
necessary has to be fixed: i.e., ranges are also
possible. With range, e.g., n = [
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ] it would be
generated bi-grams (n=2), plus trigrams (n=3),
plus tetra-grams (n=4).
      </p>
      <p>V.</p>
      <p>EXPERIMENTS AND RESULTS</p>
      <p>Our experiments were carried out on the corpus described
in Section III using the methods and features described in</p>
    </sec>
    <sec id="sec-9">
      <title>Section IV.</title>
      <p>classifier.</p>
      <p>
        We used the implementations of the methods incorporated
into the WEKA 3.8 machine learning toolkit2. WEKA [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
allowed both: the extraction of features and selection of the
      </p>
      <p>In all our experiments we used 10 fold cross validation and
evaluated</p>
      <p>accuracy (1) and f-score (2). The results are
considered acceptable and reasonable if the accuracy is above
random (3) and</p>
      <p>majority (4) baselines equal to 0.502 and
0.540, respectively.</p>
      <p>_
=
=
here tp (true positives), tn (true negatives), fp(false positives), fn (false
negatives) denote a number of correctly classified instances ci with ci and cj
with any other cj, incorrectly classified instances ci with any other cj and any
other cj with ci, respectively
max(  )
∑  2</p>
      <p />
      <p>
        Our preliminary experiments involved the selection of the
most accurate
classification technique
when
using token
unigrams (n=1), token tetra-grams (n=4) and character
tetragrams (n=4) (the results are presented in Figure 4). The best
results were achieved with SVM
and NBM
and character
tetra-grams3. These
methods also demonstrated the
best
2 Download from: http://www.cs.waikato.ac.nz/ml/weka/downloading.html
3 Since the f-score values demonstrate the same trend compared to the
accuracies, we do not present them in the following figures.
(1)
(2)
(3)
(4)
performance in gender detection tasks on the morphologically
complex Arabic language [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Later on we used only SVM and NBM (because they
demonstrated the best performance in the preliminary
experiments) by tuning a parameter n in the character n-grams
(see obtained results in Figure 5).</p>
      <p>
        The overall best results (reaching 0.836 of the accuracy)
on the short non-normative Lithuanian texts for the gender
detection task were achieved with the NBM and character
ngrams of n=[
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ] as the feature type (see Figure 6); therefore
we would recommend them for the other similar tasks and
languages.
      </p>
      <p>
        By the way, the best of the only previously reported results
for the Lithuanian language in the gender detection task were
achieved with the SVM and lemma bi-grams as the feature
type [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It is not surprising having in mind that
morphological tools (dealing with the normative texts) were
maximally helpful. Besides, the second best feature type was
based on the character n-grams, too. Despite our best method
achieved slightly higher accuracy (by 0.089) compared to the
previously reported, the direct comparison is hardly possible
due to the very different experimental conditions (datasets and
their sizes, language types, text lengths, etc.).
      </p>
      <p>
        The gender detection task is solved for a rather big group
of languages, reaching ~80% and ~56.53% of accuracy on the
normative English in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], respectively; 64.73% on the
Spanish blogs in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and ~82.6% on the Greek blogs [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. On
the non-normative tweet texts the obtained accuracies are
surprisingly high reaching, e.g., ~98% on Arabic in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
~99% on English in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. As we can see, the reported results,
especially for the English language, are very controversial
(~56.53% in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and even ~99% in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]). Due to very
different experimental conditions (different datasets, used
methods and language types) these results become hardly
comparable. They are also hardly comparable with the results
obtained in our research work.
      </p>
      <p>VI.</p>
      <p>CONCLUSION AND FUTURE WORKS</p>
      <p>In this paper we report the first gender detection results
using short (of only avg. ~39 tokens) Lithuanian
nonnormative texts taken from the Facebook social network.
During our research we investigated the most popular
supervised machine learning (Naïve Bayes, Naïve Bayes
Multinomial, Support Vector Machine) and similarity-based
(IBK, kStart) techniques plus various lexical and character
feature types.</p>
      <p>
        The best results reaching 83.6% of accuracy were achieved
with the Naïve Bayes Multinomial method and character
ngrams (of n = [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ]) as features.
      </p>
      <p>Since the majority of the research done for the Lithuanian
langue is mostly focused on the normative texts, in the future
research we are planning to pay special attention to this
problem by increasing the datasets and tackling the other
author profiling tasks as age detection, social status detection,
etc.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>H. Van Halteren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Baayen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tweedie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haverkort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neijt</surname>
          </string-name>
          ,
          <article-title>"New Machine Learning Methods Demonstrate the Existence of a Human Stylome,"</article-title>
          <source>Journal of Quantitative Linguistics</source>
          , vol.
          <volume>12</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>65</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Mendenhall</surname>
          </string-name>
          ,
          <article-title>"The Characteristic Curves of Composition,"</article-title>
          pp.
          <fpage>37</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1851</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <article-title>"Effects of Age and Gender on Blogging,"</article-title>
          <source>Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs</source>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>197</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Shimoni</surname>
          </string-name>
          ,
          <article-title>"Automatically Categorizing Written Texts by Author Gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          , vol.
          <volume>17</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dawhle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <article-title>"Lexical Predictors of Personality Type,"</article-title>
          <source>Proceedings of Classification Society of North America, St. Louis MI</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stomatatos</surname>
          </string-name>
          ,
          <article-title>"A Survey of Modern Author,"</article-title>
          <source>Journal of the American Society for Information Science and Technology</source>
          , Wiley, pp.
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Meina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Brodzinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Celmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Czokow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pezacki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wilk</surname>
          </string-name>
          ,
          <article-title>"Ensemble-based classification for author profiling using," Notebook for PAN at</article-title>
          CLEF,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T. Raghunadha</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Vishnu</given-names>
            <surname>Vardhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vijayapal</surname>
          </string-name>
          <string-name>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>"Profile specific Document Weighted approach using a New Term Weighting Measure for Author Profiling,"</article-title>
          <source>International Journal of Intelligent Engineering &amp; Systems</source>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Overview of the 3rd Author Profiling Task at PAN</source>
          <year>2015</year>
          ,
          <article-title>"</article-title>
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>AlSukhni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Alequr</surname>
          </string-name>
          ,
          <article-title>"Investigation the Use of Machine Learning Algorithms in Detecting Gender of the Arabic Tweet Author,"</article-title>
          <source>International Journal of Advanced Computer Science and Applications</source>
          , pp.
          <fpage>319</fpage>
          -
          <lpage>328</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapočiūtė-Dzikienė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Šarkutė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Utka</surname>
          </string-name>
          ,
          <article-title>"Automatic author profiling of Lithuanian parliamentary speeches: exploring the influence of features and dataset sizes," Human language technologies - the Baltic perspective: proceedings of the 6th inter</article-title>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          , "
          <string-name>
            <surname>Support-Vector</surname>
            <given-names>Networks</given-names>
          </string-name>
          ," pp.
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>"Text Categorization with Support Vector Machines: Learning with Many Relevant Features,"</article-title>
          pp.
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>D. D. Lewis</surname>
            ,
            <given-names>W. A.</given-names>
          </string-name>
          <string-name>
            <surname>Gale</surname>
          </string-name>
          ,
          <article-title>"A Sequential Algorithm for Training Text Classifiers,"</article-title>
          <source>17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval</source>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Aha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kibler</surname>
          </string-name>
          ,
          <article-title>"Instance-based learning algorithms,"</article-title>
          <source>Machine Learning</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Cleary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Trigg</surname>
          </string-name>
          ,
          <article-title>"K*: An Instance-based Learner Using an Entropic Distance Measure,"</article-title>
          <source>12th International Conference on Machine Learning</source>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>114</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>(2016) WEKA</article-title>
          . [Online]. http://www.cs.waikato.ac.nz/ml/weka/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapočiūtė-Dzikienė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Utka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Šarkutė</surname>
          </string-name>
          ,
          <article-title>"Authorship Attribution and Author Profiling of Lithuanian Literary Texts,"</article-title>
          <source>Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing</source>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>"Author Profiling: Predicting Age and Gender from Blogs," Notebook for PAN at CLEF</article-title>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Mikros</surname>
          </string-name>
          ,
          <article-title>"Authorship Attribution and Gender Identification in Greek Blogs," Methods and Applications of Quantitative Linguistics</article-title>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dickinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>"Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features,"</article-title>
          <source>International Journal of Intelligence Science</source>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>148</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>