<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Gender and language-variety identification with MicroTC</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eric S. Tellez</string-name>
          <email>eric.tellez@infotec.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabino Miranda-Jiménez</string-name>
          <email>sabino.miranda@infotec.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Graff</string-name>
          <email>mario.graff@infotec.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Moctezuma</string-name>
          <email>dmoctezuma@centrogeo.edu.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CONACyT-CentroGEO Centro de Investigación en Geografía y Geomática “Ing. Jorge L. Tamayo” A.C.</institution>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CONACyT-INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación</institution>
          ,
          <country country="MX">México</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this notebook, we describe our approach to cope with the Author Profiling task on PAN17 which consists of both gender and language identification for Twitter's users. We used our MicroTC ( TC) framework as the primary tool to create our classifiers. TC follows a simple approach to text classification; it converts the problem of text classification to a model selection problem using several simple text transformations, a combination of tokenizers, a term-weighting scheme, and finally, it classifies using a Support Vector Machine. Our approach reaches accuracies of 0:7838, 0:8054, 0:7957, and 0:8538, for gender identification; and for language variety, it achieves 0:8275, 0:9004, 0:9554, and 0:9850. All these, for Arabic, English, Spanish, and Portuguese languages, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, forensic text analysis about originality, authorship, and reliability has
attracted a lot of attention by researchers and practitioners because of practical
applications in security and marketing [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In this context, author profiling is
an important task of PAN@CLEF forum that focuses on analyzing some
characteristics of the author (profiling aspects) based on the written author’s text,
such as gender, age, political preferences, personality, language variety, among
others [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Generally speaking, author profiling task is tackled using, mainly,
machinelearning approaches, i.e., models, for predicting profiling aspects, are built
considering a set of general features that represent different categories of authors,
e.g., gender, range age, and language variety, among others [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>PAN forum 20173 provides a dataset of tweets for training and test the
performance of each participating system. In this edition, the profiling aspects to
3 http://pan.webis.de/clef17/pan17-web/author-profiling.html
be analyzed are gender and language of Twitter’s users. The corpus is annotated
with authors’ gender and their particular variation of their mother tongue that
includes Arabic, English, Spanish, and Portuguese.</p>
      <p>Our approach is language independent, that is, we deliberately avoid the
use of linguistic procedures such as part-of-speech tagging, lemmatization, or
stemming. In the same way, linguistic resources, like lexicons and
WordNetbased, are disallowed. In contrast, we take advantage of multiple tokenizers, an
entropy-based term-weighting scheme, and an SVM classifier, see Section 3 for
details.</p>
      <p>The rest of the paper is organized as follows. Section 2 presents few of the
gender, age, language, and region identification related works, and Section 3
describes our system and the general approach to model the problem. Section 4
detail the experimental methodology and the achieved results. Finally,
conclusions and future work are given in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Author profiling is a repetitive and important task in PAN contest since 2013
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Before 2017 edition, only age and gender classification tasks were considered
[
        <xref ref-type="bibr" rid="ref17 ref18">17,18</xref>
        ]. This year, the PAN considers the region aspect while removes the age
identification subtask from the competition [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Several works have been proposed to solve age and gender identification
subtasks. Agrawal &amp; Gonçalves [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] use a combination of classifiers along with a
model based on user’s activities to predict the profile of the unknown users. The
TFIDF representation was employed, and a dimension reduction was performed
in this matrix. The authors use Naive Bayes and Linear SVM as classifiers.
      </p>
      <p>
        With the purpose to find the differences between writing styles of males
and females in different age groups, the usage of several stylometric features
is considered in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Another stylometric approach was presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] where
two groups of features were considered, trigrams and complementary-weighted
Second Order Attributes. An SVM classifier is used in the classification step. A
combination of features based on word n-grams, sentences starting with capital
letters, finish the sentences with a dot, emoticons, word’s length and sentence’s
length is also used along with grammatical aspects are explored in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        Lopez-Monroy et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] propose a representation for documents that capture
discriminative and subprofile-specific information of terms. Under the proposed
representation, terms are represented in a vector space that captures
discriminative information. On the other hand, more traditional representations, like
TFIDF, are broadly employed in the author’s profiling literature, that is the
case of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Classification ensembles are also frequently used; for
instance, [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] generate several classifiers using sets of features such as word
ngram, character n-gram, and part-of-speech n-gram features.
      </p>
      <p>
        Language variety identification is a new subtask introduced in PAN17 that
consists in determining the specific variation of the native language of authors’
text [
        <xref ref-type="bibr" rid="ref10 ref11">11,10</xref>
        ]. Another approach to region classification is presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where
twitter geolocation and regional classification was conducted through sparse
coding and dictionary learning. Another region prediction approach based on
Modified Adsorption, removing “celebrity” nodes and analyzing a graph model
propagation is proposed in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System description</title>
      <p>
        MicroTC ( TC) is a generic framework for text classification task, i.e., it works
regardless of both domain and language particularities. TC is an extension of
our previous work on sentiment analysis, see [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. A full description of TC can
be found in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The core idea behind TC is to tackle a text classification task
by selecting an appropriate configuration from a set of different text
transformations techniques, tokenizers, and several weighting schemes, using as a classifier a
Support Vector Machine (SVM) with linear kernel. In some sense, the text
classification problem is transformed into hyper-parameter optimization, also known
as model selection.
3.1
      </p>
      <sec id="sec-3-1">
        <title>About TC</title>
        <p>
          Briefly, TC contains the following parts: i) a list of functions that normalize
and transform the input text to the input of tokenizers (preprocessing), ii) a set
of tokenizer functions that transform the filtered text into a multiset of tokens,
iii) a function that generates weighted vectors from the multiset of tokens; and
finally, iv) a classifier that knows how to assign a label to a given vector.
i. Preprocessing functions We use trivalent and binary parameters. The
trivalent values can be set to fremove; group; noneg which means that the
term matching the parameter is removed, grouped in set of predefined classes,
or left untouched. In this kind of parameters, TC contains handlers for
hashtags, numbers, urls, users, and emoticons. The binary parameters are
boolean, and basically, indicate if the parameter is activated or not. In this
parameter set, we support for diacritic removal, character duplication
removal, punctuation removal, and case normalization.
ii. Tokenizers After all text normalization and transformation, a list of tokens
should be extracted. We allow to use n-grams of words (n = 1; 2; 3), q-grams
of characters (q = 1; 3; 5; 7; 9), and skip-grams. For skip-grams we allow to
select a few tokenizers like two words with gap one, (2; 1), also we allow to
use (2; 2), (3; 1). Instead of selecting one or another tokenizer scheme, we
allow to select any combination of the available tokenizers, and perform the
union of the final multisets of tokens.
iii. Weighting schemes. After we obtained a multiset (bag of tokens) from
the tokenizers, we must create a vector space. MicroTC allows to use the
raw frequency and the TFIDF scheme to weight the coordinates of the
vector. It contains a number of frequency filters that were deactivated for this
contribution, see [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] for more details.
iv. Classifier We decide to use a singleton set populated with an SVM with a
linear kernel. It is well known that SVM performs excellently for very large
dimensional input (which is our case), and the linear kernel also performs
well under this conditions. We do not optimize the parameters of the classifier
since we are pretty interested in the rest of the process. We use the SVM
classifier from liblinear, Fan et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Modeling users</title>
        <p>We select to model a user using all its tweets, that is, an user u is a collection of
small texts u = ft1; : : : ; tng. For each text, we apply the preprocessing step and
tokenizers, then we create a multiset from the union of all multisets in u. After
this, a vector u is created using a term weighting scheme. Thus, we modeled
each user as a high dimensional sparse vector. For instance, since we do not
remove any kind of terms, and in fact we promote the usage of combinations of
tokenizers, the user’s vectors can contain millions of coordinates, and thousand
non-zero entries.</p>
        <p>The weighting schemes for this modeling are described in the following
paragraphs. We also introduce entropy+b, a new weighting scheme introduced in
this notebook designed for classification tasks. In the following paragraphs we
describe in detail the weighting schemes used in the experimental section.</p>
        <p>The simpler scheme corresponds to freq, and it is defined as the term
frequency of each term per user; we name it frequsr to avoid confusion with other
functions. TFIDF is the product of TF and IDF where TF is the normalized
frequency of a user’s term, and IDF is the inverse document frequency defined as
the logarithm of the inverse of the probability that a term occurs in the whole
collection of users, more precisely,</p>
        <p>TF(w; usr) =
IDF(w) = log</p>
        <p>frequsr(w)
maxw2usrffrequsr(w)g</p>
        <p>N
;
;
and</p>
        <p>jfusr j frequsr(w) &gt; 0gj
where N is the size of the training collection, i.e., the number of users. It is
common to add 1 to the denominator expression to avoid numerical problems.</p>
        <p>In this notebook, we introduce the entropy+b term-weighting that considers
that each term is represented by a distribution over the available classes. Instead
of using the raw probabilities per class, we weight each term with the Entropy+b
function, defined as follows:
entropyb(w) = log jCj</p>
        <p>X pc(w; b) log
c2C</p>
        <p>1
pc(w; b)
;
where C is the set of classes, and pc(w; b) is the probability of term w in class c
parametrized with b. More detailed,
pc(w; b) =</p>
        <p>freqc(w)
b jCj + Pc2C freqc(w)
:
Here, freqc denotes the frequency of the given term in the class c. The idea
behind entropyb(w) is to weight each term using the entropy of the underlying
distribution in a way that large entropy values (terms uniformly distributed
along all classes) have a low weight while terms being skewed to some class are
close to log jCj. The parameter b is introduced to absorb the possible noise that
occurs in low populated terms.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>About the model selection</title>
        <p>
          The model selection is lead by a performance function score that is maximized
(solved) by a meta-heuristic. The only assumption is that score slowly varies on
similar configurations, such that we can assume some degree of locally
concaveness, in the sense that a local maximum can be reached using greedy decisions
at some given point. Clearly, this is not true in general and the solver algorithm
should be robust enough to get a good approximation even when the
assumption is valid only with some degree of certainty. From a practical point of view,
a configuration is similar to another if structurally vary in a single parameter.
We name the set of all similar configurations of m as its neighborhood.
Therefore, the core idea is to start from a set of random configurations, evaluate their
neighborhoods and greedily move to the most promising set of configurations,
The procedure is repeated until some condition is achieved, like the
impossibility of improve the score function, or when a maximum number of iterations is
reached. There are several meta-heuristics to solve combinatorial optimization
problems, the proper survey of the area is beyond the scope of this notebook;
however, the interested reader is referred to [
          <xref ref-type="bibr" rid="ref2 ref20 ref21 ref6">20,21,6,2</xref>
          ].
        </p>
        <p>
          In particular, TC uses two types of meta-heuristics, Random Search [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and
Hill Climbing [
          <xref ref-type="bibr" rid="ref2 ref6">6,2</xref>
          ] algorithms. The former consists in randomly sampling C and
selecting the best configuration among that sample. Given a pivoting
configuration, the main idea behind Hill Climbing is to explore the configuration’s
neighborhood and greedily move to the best neighbor. The process is repeated
until no improvement is possible. We improve the whole optimization process
applying a Hill Climbing procedure over the best configuration found by a Random
Search. We also add memory to avoid a configuration to be evaluated twice4.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and results</title>
      <p>The experiments with the training set were run in an Intel(R) Xeon(R) CPU
E52640 v3 @ 2.60GHz with 32 threads and 192 GiB of RAM running CentOS 7.1
Linux. The gold-standard were evaluated in the TIRA platform using a virtual
machine with 4GiB of RAM and one core. We implemented TC5 on Python.</p>
      <p>We partition the full training dataset into two smaller sets, a new training
set containing 30% of the users, and a validation set with the resting 70%. The
4 In principle, this is similar to Tabu search; however, our implementation is simpler
than a typical implementation of Tabu search.
5 Available under Apache 2 license at https://github.com/INGEOTEC/microTC
partition where selected to ensure the generalization of our scheme. On the new
training set, from now on just training set, we run TC using random search
and hill climbing to perform the hyper-parameter optimization. Random search
was allowed to select 32 random configurations. On the other hand, Hill-climbing
starts with the best configuration found by random search; the procedure was
left to finish its optimization process. We use 3-fold cross validation for the model
selection procedure. Once the model selection finished, we use the configuration
found to train a TC machine with the whole (small) training set and measure
the performance of that classifier in the validation set.</p>
      <p>name
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100</p>
      <p>Table 1 shows the performance of TC for gender identification. In
particular, we show macro-recall, macro-f1, and accuracy scores. We show three
different term-weighting schemes, detailed in §3. We select the FREQ scheme to
describe the improvement of each scheme. The FREQ and T F IDF schemes are
implemented in TC; for entropy+b, we show the performance for five different
values of b. Table 1 indicates that T F IDF performs poorly as compared with
FREQ. Entropy+b illustrates the dependency of b, showing better performances
for small b values, except b = 0 which has a poor performance for gender
identification. The table shows that b = 3 and b = 10 performs much better than the
rest of the classifiers. Between entropy+3 and entropy+10, the first one performs
better; however, entropy+3 was evaluated after the deadline of the second run.
Therefore, entropy+10 was used to classify the gold standard, see Table 3.</p>
      <p>name
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100
TC-FREQ
TC-TFIDF
TC-entropy+0
TC-entropy+3
TC-entropy+10
TC-entropy+30
TC-entropy+100</p>
      <p>Table 2 shows the performance of our systems in the language variety task. As
before, we use FREQ as the baseline method. In this task, FREQ also performs
name
TC-FREQ
TC-entropy+10
TC-FREQ
TC-entropy+10
TC-FREQ
TC-entropy+10
TC-FREQ
TC-entropy+10
better than TFIDF, excepting for English; both approaches are part of the TC
tool. The entropy+b scheme is much better for almost any of the presented
b’s, even for b = 0. As in the gender identification task, the smaller values
of b perform better than larger values, achieving the best performance when
b = 3. Nonetheless, we used entropy+10 to classify the gold standard because
the deadline hit us.</p>
      <p>The official performances on the PAN17 gold standard are shown in Table 3.
We send our baseline based on the FREQ weighting scheme and the profiler
based on entropy+10. The table indicates the accuracy for gender and variety
tasks, as well for the joint accuracy (the same example was correctly predicted in
both tasks). As predicted in Tables 1 and 2, entropy+10 has a better performance
than FREQ, in some languages by a large margin, e.g., close to five percentual
points for Arabic, and six percentual points for Portuguese.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        In this notebook, we describe the INGEOTEC’s system used to solve the Author
Profiling task in PAN17. We used our MicroTC ( TC) framework [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] as the
primary tool to create our classifiers. TC follows a simple approach to text
classification; it converts the problem of text classification to a model selection
problem using several simple text transformations, a combination of tokenizers,
a term-weighting scheme, and an SVM classifier. It is designed to tackle
textclassification problems in an agnostic way, being both domain and language
independent.
      </p>
      <p>To effectively tackle the task, we introduce a new term-weighting scheme
based on the distributional representation of each term and the entropy over
that distribution. We call it entropy+b. More work is needed to characterize the
new weighting scheme yet it demonstrated to be superior to raw term frequency
and TFIDF, at least, for the Author Profiling task and our TC framework.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We like to thank the PAN organizers, in particular to Francisco Rangel and
Martin Potthast for their kind and quick response to our questions and requests.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Age and gender identification using stacking for classification</article-title>
          .
          <source>notebook for pan at clef</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Battiti</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mascia</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Reactive search and intelligent optimization</article-title>
          , vol.
          <volume>45</volume>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bergstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>Journal of Machine Learning Research 13(Feb)</source>
          ,
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bilan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhekova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Caps: A cross-genre author profiling system (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bougiatiotis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krithara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Author profiling using complementary second order attributes and stylometric features (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Burke</surname>
            ,
            <given-names>E.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kendall</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Search methodologies</article-title>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gwon</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kung</surname>
          </string-name>
          , H.:
          <article-title>Twitter geolocation and regional classification via sparse coding</article-title>
          .
          <source>In: ICWSM</source>
          . pp.
          <fpage>582</fpage>
          -
          <lpage>585</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dichiu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rancea</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>Using machine learning algorithms for author profiling in social media (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Liblinear: A library for large linear classification</article-title>
          .
          <source>Journal of machine learning research 9(Aug)</source>
          ,
          <fpage>1871</fpage>
          -
          <lpage>1874</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Francisco Rangel, Marc Franco-Salvador,
          <string-name>
            <surname>P.R.:</surname>
          </string-name>
          <article-title>A low dimensionality representation for language variety identification</article-title>
          .
          <source>In: In Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016</source>
          . Springer-Verlag (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taulé</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martít</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Language variety identification using distributed representations of words and documents</article-title>
          . In:
          <article-title>International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          . pp.
          <fpage>28</fpage>
          -
          <lpage>40</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          , y Gómez,
          <string-name>
            <given-names>M.M.</given-names>
            ,
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.J.</given-names>
            ,
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Discriminative subprofile-specific representations for author profiling in social media</article-title>
          .
          <source>Knowledge-Based Systems 89</source>
          ,
          <fpage>134</fpage>
          -
          <lpage>147</lpage>
          (
          <year>2015</year>
          ), http://www.sciencedirect.com/science/article/pii/S0950705115002427
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Adapting cross-genre author profiling to language and corpus</article-title>
          .
          <source>In: Proceedings of the CLEF</source>
          . pp.
          <fpage>947</fpage>
          -
          <lpage>955</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : Overview of PAN'17:
          <string-name>
            <surname>Author</surname>
            <given-names>Identification</given-names>
          </string-name>
          , Author Profiling, and
          <string-name>
            <given-names>Author</given-names>
            <surname>Obfuscation</surname>
          </string-name>
          . In: Jones,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          .
          <source>8th International Conference of the CLEF Initiative (CLEF 17)</source>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rahimi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Twitter user geolocation using a unified text and network prediction model</article-title>
          .
          <source>arXiv preprint arXiv:1506.08259</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          , T. (eds.)
          <article-title>Working Notes Papers of the CLEF 2017 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd author profiling task at pan 2015</article-title>
          . In: CLEF (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th author profiling task at pan 2016: cross-genre evaluations</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of PAN'16-New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation</article-title>
          . In: Fuhr,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Quaresma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Gonçalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Balog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          .
          <source>7th International Conference of the CLEF Initiative (CLEF 16)</source>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tellez</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Jiménez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graff</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suárez</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siordia</surname>
            ,
            <given-names>O.S.:</given-names>
          </string-name>
          <article-title>A simple approach to multilingual polarity classification in twitter</article-title>
          .
          <source>Pattern Recognition</source>
          Letters pp.
          <string-name>
            <surname>-</surname>
          </string-name>
          (
          <year>2017</year>
          ), http://www.sciencedirect.com/science/article/pii/S0167865517301721
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Tellez</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Jímenez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graff</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>An automated text categorization framework based on hyperparameter optimization</article-title>
          .
          <source>arXiv preprint arXiv:1704</source>
          .
          <year>01975</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Ucelay</surname>
            ,
            <given-names>M.J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funez</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cagnina</surname>
            ,
            <given-names>L.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Errecalde</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramırez-de-la Rosa</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villatoro-Tello</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Profile-based approach for age and gender identification (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23. op Vollenbroek,
          <string-name>
            <given-names>M.B.</given-names>
            ,
            <surname>Carlotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Kreutz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Medvedeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bjerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Haagsma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Gronup: Groningen user profiling (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zahid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sampath</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farnadi</surname>
          </string-name>
          , G.:
          <article-title>Cross-genre age and gender identification in social media (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>