<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Subword-based Deep Averaging Networks for Author Profiling in Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arabic English Portuguese Spanish</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Marc Franco-Salvador</institution>
          ,
          <addr-line>Nataliia Plotnikova, Neha Pawar, and Yassine Benajiba</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Symanto Research</institution>
          ,
          <addr-line>Nuremberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>Author profiling aims at identifying the authors' traits on the basis of their sociolect aspect, that is, how language is shared by them. This work describes the system submitted by Symanto Research for the PAN 2017 Author Profiling Shared Task. The current edition is focused on language variety and gender identification on Twitter. We address these tasks by exploiting the morphology and semantics of the words. For that purpose, we generate embeddings of the authors' text based on subword character n-grams. These representations are classified using deep averaging networks. Experimental results show competitive performance in the evaluated author profiling tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Author profiling aims at identifying the authors’ traits on the basis of their sociolect
aspect, that is, how language is shared by them. It is used to determine language
variety, gender, age, and personality type, among others. This task is specially attractive
to industry representatives and particularly helpful for author opinion segmentation in
social media. For instance, identifying the geographical distribution and gender of
opinion authors may help to improve marketing campaigns. The task is also important for
digital text forensics. Given a threat, knowing the possible author traits may help to its
identification.</p>
      <p>The Uncovering Plagiarism, Authorship, and Social Software Misuse1 (PAN)
evaluation lab at the Conference and Labs of the Evaluation Forum2 (CLEF) promotes
research and innovation in digital text forensics. Its Author Profiling Shared Task set the
objective of classifying authors’ traits in several subtasks. These include the
identification of age, cross-genre age, personality traits, and gender in social media. The current
edition3 focuses on language variety and gender identification on Twitter.</p>
      <p>
        Both morphological [
        <xref ref-type="bibr" rid="ref1 ref6">1,6</xref>
        ] and semantic [
        <xref ref-type="bibr" rid="ref2 ref7">7,2</xref>
        ] features have proven to be highly
discriminant in author profiling. To build on research, we exploit in this work word
morphology and semantics to identify the authors’ language variety and gender. We present
1 http://pan.webis.de/
2 http://www.clef-initiative.eu/
3 http://pan.webis.de/clef17/pan17-web/author-profiling.html
an approach based on word embeddings which in turn are generated using the
subword information, i.e., by means of character n-gram embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We classify the
author traits using deep averaging networks, a recent technique which magnifies the
most discriminant dimensions contained within an embedding average. This has been
demonstrated to be a fast and competitive approach in several text classification tasks
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] — rivalling the recurrent or convolutional neural networks performance.
      </p>
      <p>The rest of the work is structured as follows: in Section 2 we provide an overview of
the state of the art in author profiling. In Section 3 we describe the system we employed
for the PAN 2017 Author Profiling Shared Task. Next, in Section 4 we conduct our
evaluation and discussion of the results. Finally, we draw our conclusions in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Authorship attribution [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the task of identifying authors’ stylistic discriminators, set
the stage for the author profiling task. The use of stylistic features such as character and
part-of-speech (PoS) n-grams, as well as spelling and grammatical errors, allowed us to
identify authors’ native language [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Similarly, [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] identified age and gender in blogs
using stylistic and content word features. The popularity of author profiling motivated
the organization of several workshops and shared tasks.
      </p>
      <p>
        The Native Language Identification Shared Task [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] allowed participants to
classify English essays representing eleven native languages. The Shared Task on
Discriminating between Similar Languages (DSL) set the objective of classifying texts
representing several sets of closely related languages and language varieties [
        <xref ref-type="bibr" rid="ref17 ref29 ref30">29,30,17</xref>
        ].
Since 2013, the PAN evaluation lab organized the Author Profiling Shared Task. The
first two editions focused on age and gender identification [
        <xref ref-type="bibr" rid="ref21 ref22">22,21</xref>
        ]. In addition to these
two tasks, personality traits recognition was included in 2015 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Finally, the focus of
the 2016 edition was cross-genre age and gender identification [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ][
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        This year, the PAN author profiling track is focused on the tasks of language variety
and gender identification. Regarding the latter, most of the recent work on gender
identification originated in the PAN evaluation lab. The system winner of the 2013-2015
editions is based on a representation for documents which captures discriminative and
subprofile-specific information [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Similar to the early work on the subject, the best
performing system in 2016 employed content words, emoticons, and stylistic features
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The language variety identification task has attracted much interest in the last few
years. Character n-grams and other features have been employed to identify varieties
of Portuguese in news texts [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], Arabic in blogs and forums [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and Spanish in
tweets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Word embeddings were used to classify varieties of Spanish from blogs
and journalistic texts [
        <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
        ]. Also in the Spanish blogs domain, [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] a low dimensional
model based on text statistics was employed. The best performing system of DSL 2015
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] used an ensemble of models based on word and character n-grams.
      </p>
      <p>Unlike the majority of author profiling researchers, which employ stylistic and
lexical features, our approach is based on character n-gram word embeddings, with exploit
the morphology and semantics of words. This choice has also been driven by our
motivation to experiment with a pipeline that could be replicated fairly simply by researchers
who want to compare results and practitioners in need of a simple, yet accurate, pipeline
to perform author profiling.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <p>In this section we describe the system we designed for language variety and gender
identification on Twitter. First, in Section 3.1 we describe our data preprocessing. Next,
in Section 3.2 the embedding representations are described. Finally, in Section 3.3 we
detail our classifier.
3.1</p>
      <sec id="sec-3-1">
        <title>Preprocessing</title>
        <p>We preprocess each text with tokenization, word lowercase, and removing URLs. We
use the Tweet NLP4 tokenizer, which is specific for English tweets. We slightly modified
its regular expressions to consider Arabic, Portuguese, and Spanish punctuation, e.g. ’¿’
and ’¡’ were included for Spanish.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Subword Character n-gram Embeddings</title>
        <p>
          In recent years, word embeddings replaced the bag-of-words (BOW) representation as
the standard for text feature extraction.5 These representations are low d-dimensional
real-valued vectors which capture semantic and syntactic aspects of text. The
continuous skip-gram model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] of the word2vec toolkit is the preferred alternative to generate
the embeddings.
        </p>
        <p>We should note the importance of morphology in author profiling. For instance, the
derivation of words is a discriminant feature in English language variety identification,
e.g. regularized vs. regularised. As an additional example, the morphological refraction
is indicative of gender in Latin languages, e.g. profesor vs. profesora in Spanish (male
and female professor word translation, respectively).</p>
        <p>
          In this work we use a recent variant of the continuous skip-gram model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] which
generates word embeddings exploiting the words’ morphology by means of character
n-gram embeddings. In addition to helping better capture the morphological nuances
that we previously mentioned, a character based embedding model also helps to create
robust classification models in the presence of typos and abbreviations as is usually the
case in social media data.
        </p>
        <p>When it comes to learning these embeddings, the main difference of this subword model
is in the scoring function used to estimate the probability of observing a context word
wc given a target word wt. The original model used the scalar product of the word
vectors as scoring: s(wt; wc) = uTwt vwc , where uwt and vwc are vectors in Rd. The
subword model uses instead a scoring function which represents the target word as the
sum of its character n-gram vectors:
4 http://www.cs.cmu.edu/~ark/TweetNLP/
5 We note the increasing number of papers published at the ACL conference with "word
embeddings" or "distributed representations" as part of the title: 0 (2013), 3 (2014), 15 (2015), and
29 (2016).</p>
        <p>s(w; c) = X zgT vc;
g2Gw
(1)
being Gwf1; :::; Gg the set of n-grams of the word w, and zg and vc vectors in Rd. Key
of the model’s design is the use of a hashing function to map n-grams to integers that
represent the vector index. This makes the model memory efficient and provides with an
additional feature: it does not produce out-of-vocabulary words. The embedding of an
unknown word is created by extracting its n-grams and doing the average of the vectors
with the indexes returned by the hash function. For more details about the model please
refer to its original work.</p>
        <p>We generate a word embedding inventory for the training partition (see Section 4.1)
of each language using the FastText library.6 We use 300-dimensional vectors, context
windows of size 10, 20 negative words for each sample, 15 epochs, and 2M hashed
character n-gram vectors. We extract n-grams with length in [3; :::; 6]. We post-process
and enrich the embeddings with a proprietary model c Symanto Research.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Deep Averaging Networks</title>
        <p>
          A standard method to obtain vector representations of text consists on computing the
average of the word embeddings [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This embedding composition method obtained
good results in language variety identification [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. However, the longer the text, the
more abstract the resulting embedding is.
        </p>
        <p>
          In this work we classify using Deep Averaging Networks (DAN) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. As
illustrated in Figure 1, this model receives as input the word embeddings of the text. First,
a composition layer is put in place to average those embeddings. It proceeds then to
use one or many non-linear hidden layers to transform the computed average. Finally,
a softmax layer is used for prediction. The rationale behind DAN is that the non-linear
transformations applied to the average allow to magnify and capture subtle variations in
a more precise manner. As reported in the original paper, this approach can outperform
syntactically informed approaches despite its simplicity.
        </p>
        <p>
          Our hidden layers have size equal to the embedding one and use the rectified linear
units (ReLU) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] as activation function. We use the cross-entropy loss function. The
number of hidden layers is determined in Section 4.2. We optimize the neural network
weights with Adam [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], learning rate = 0:001 and 100 epochs, using the parameters
indicated on its original work. We should note that our word embeddings are static so
we do not allow the model to modify them.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section we evaluate our approach in the PAN 2017 Author Profiling Shared Task.
6 https://github.com/facebookresearch/fastText
Dataset The objective of the PAN 2017 author profiling shared task is to identify the
language variety and gender of Twitter users. Its corpus contains four languages and
nineteen language varieties:
– Arabic (Egypt, Gulf, Levantine, and Maghrebi).
– English (Australia, Canada, Great Britain, Ireland, New Zealand, and United States).
– Portuguese (Brazil and Portugal).
– Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, and Venezuela).</p>
      <p>Next, we mention some key remarks about the dataset. The language of the user is
known, so the dataset is composed by four partitions. In Table 1 we show the statistics.
The labels are balanced at language variety and gender level. Finally, each Twitter user
is represented by a set of approximately 100 tweets.7</p>
      <p>
        In this work, we concatenate the user tweets to have an unique instance. We explored
other alternatives, as the independent classification of the tweets with a subsequent sum
of the class probabilities [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, with this dataset, we obtained higher results after
concatenating the tweets.
      </p>
      <p>Methodology We compare our results with those obtained by the random baseline, a
BOW model classified with random forest, a model based on continuous skip-gram
embedding averages classified with logistic regression, and a model based on the subword
7 Each tweet is composed by up to 140 characters.
Training users 2,400 3,600
Test users 800 1,200
Language varieties 4 6
embedding (see Section 3.2) averages classified with logistic regression. In the rest of
the evaluation we refer to these models as Random, BOW, skip-gram emb., and
subword emb., respectively. The prototype of our model (henceforth simply referred to as
DAN) was designed using 10-fold cross-validation over the training sets. The parameter
selection uses the same setting. The official measure of the competition is the accuracy.
The ranking of the shared task participants is estimated as follows: i) for each language,
the PAN organizers calculate individual accuracies for gender and variety identification;
ii) they calculate the accuracy when both variety and gender are properly predicted
together; and iii) the final ranking is obtained by averaging those accuracy values obtained
per language.
4.2</p>
      <sec id="sec-4-1">
        <title>Parameter Selection</title>
        <p>We noticed during our experimentation phase that the performance of DAN is very
sensitive to the number of hidden layers, which differ in function of the task and dataset.
In Figure 2 we show the accuracy depending on the number of hidden layers, task,
and language. As you can see, the two tasks benefit from adding layers after
composition/averaging one. The best performance for language variety identification is achieved
using two layers. In contrast, the optimal number of hidden layers for gender
identification differs depending on the language. We use the best parameters determined in this
section for the rest of the evaluation.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results and Discussion</title>
        <p>In this section we compare and discuss the results of our system. In Table 2 we show
the development experiments and the comparison with the baseline models (see
Section 4.1) using 10-fold cross-validation over the training set. As we can see, the three
embedding-based models outperform BOW, the only purely lexical approach. The
continuous skip-gram embedding averages classified with logistic regression obtain better
results than the subword embedding averages in tasks such as language variety
identification in Arabic or gender in Portuguese. However, the latter model offers in average
higher results than the skip-gram one. Finally, DAN, using the same subword
embeddings, obtains the highest results and proves that deep averaging networks are useful
in author profiling to magnify the most discriminant values contained in an embedding
average.</p>
        <p>In Table 3 we show the results using the official test set of the shared task. This table
also includes the joint accuracy, which is employed by organizers to determine the best
95 %
90 %
system, i.e., when both variety and gender are properly predicted together. As we can
see, DAN’s results are in line with those obtained using the 10-fold cross-validation
setting. We also observe how the joint accuracy falls compared to the isolated language
variety and gender results. This manifests the difficulty of this joint classification task,
which continues being an open problem.</p>
        <p>Our final comments are to analyse the difference in difficulty of this shared task
depending on the task and language. Identifying gender is clearly more difficult than
language variety. Despite the first task has only two possible labels, gender differences
are generally more subtle and require more context and topic understanding. In contrast,
the language variety peculiarities are both differentiable using lexical and semantic
aspects of text. These lexical and semantic aspects are also the cause of the differences in
function of the language. English and Arabic varieties are more similar at lexical level
than Portuguese or Spanish ones. However, the low number of Portuguese varieties
employed in this work affects too. Finally, considering the high number of Spanish
varieties and its high results, we also consider that some languages have tweets with topics
more indicative of the variety, e.g. topics about politics or events.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this work we presented the system designed by Symanto Research for the PAN 2017
author profiling shared task. The pipeline we present in this paper is easily replicable
and yields a good performance while promising to be robust and flexible in the presence
of noisy data.</p>
      <p>We described an approach based on subword character n-gram embeddings and deep
averaging networks. We explained the rationale behind using these components in
author profiling. We compared our approach with several well-known baseline models.</p>
      <p>Random 25.0</p>
      <sec id="sec-5-1">
        <title>Language BOW 71.2</title>
        <p>variety Skip-gram emb. 73.0</p>
        <p>Subword emb. 70.7</p>
        <p>DAN 80.6
Gender</p>
        <p>Random 50.0
BOW 66.4
Skip-gram emb. 71.2
Subword emb. 73.7
DAN 74.5</p>
        <p>Experimental results in the tasks of native language and gender identification show the
superiority of our approach and demonstrate that it is a competitive alternative.</p>
        <p>Future work will investigate further how to employ semantic representations and
deep learning techniques in the task of author profiling.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Automatically profiling the author of an anonymous text</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bayot</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Author Profiling using SVMs and Word Embedding AveragesNotebook for PAN at CLEF 2016</article-title>
          . In: Balog,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Macdonald</surname>
          </string-name>
          , C. (eds.)
          <article-title>CLEF 2016 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September, Évora, Portugal.
          <source>CEUR-WS.org (Sep</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Busger op Vollenbroek,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Carlotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Kreutz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Medvedeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bjerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Haagsma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>GronUP: Groningen User Profiling-Notebook for PAN at CLEF 2016</article-title>
          . In: Balog,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Macdonald</surname>
          </string-name>
          , C. (eds.)
          <article-title>CLEF 2016 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September, Évora, Portugal.
          <source>CEURWS.org (Sep</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuksa</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of Machine Learning Research 12(Aug)</source>
          ,
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Estival</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaustad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Tat: an author profiling tool with application to arabic emails</article-title>
          .
          <source>In: Proceedings of the Australasian Language Technology Workshop</source>
          . pp.
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taulé</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martí</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Language variety identification using distributed representations of words and documents</article-title>
          .
          <source>In: Proceeding of the 6th International Conference of CLEF on Experimental IR meets Multilinguality</source>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          (CLEF
          <year>2015</year>
          ). vol.
          <source>LNCS</source>
          (
          <volume>9283</volume>
          ). Springer-Verlag (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and documents for discriminating similar languages</article-title>
          .
          <source>In: Proceeding of the RANLP Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial)</source>
          . Hissar,
          <string-name>
            <surname>Bulgaria</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hahnloser</surname>
            ,
            <given-names>R.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarpeshkar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahowald</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douglas</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seung</surname>
            ,
            <given-names>H.S.:</given-names>
          </string-name>
          <article-title>Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit</article-title>
          .
          <source>Nature</source>
          <volume>405</volume>
          (
          <issue>6789</issue>
          ),
          <fpage>947</fpage>
          -
          <lpage>951</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjunatha</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd-Graber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          , H.:
          <article-title>Deep unordered composition rivals syntactic methods for text classification</article-title>
          .
          <source>In: Association for Computational Linguistics</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Exploiting stylistic idiosyncrasies for authorship attribution</article-title>
          .
          <source>In: Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis</source>
          . vol.
          <volume>69</volume>
          , p.
          <volume>72</volume>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zigdon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automatically determining an anonymous author's native language</article-title>
          .
          <source>In: Intelligence and Security Informatics</source>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>217</lpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          , y Gómez,
          <string-name>
            <given-names>M.M.</given-names>
            ,
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.J.</given-names>
            ,
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Discriminative subprofile-specific representations for author profiling in social media</article-title>
          .
          <source>Knowledge-Based Systems 89</source>
          ,
          <fpage>134</fpage>
          -
          <lpage>147</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Maier</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Rodríguez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Language variety identification in spanish tweets</article-title>
          .
          <source>In: Proceedings of the EMNLP'2014 Workshop on Language Technology for Closely Related Languages and Language Variants</source>
          . pp.
          <fpage>25</fpage>
          -
          <lpage>35</lpage>
          . Doha, Qatar (
          <year>October 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Language identification using classifier ensembles</article-title>
          .
          <source>In: Proceeding of the RANLP Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial)</source>
          . Hissar,
          <string-name>
            <surname>Bulgaria</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Ljubešic´,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          , J.:
          <article-title>Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task</article-title>
          .
          <source>In: Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial)</source>
          . Osaka,
          <string-name>
            <surname>Japan</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Proceedings of the Annual Neural Information Processing (NIPS'13) Conference - Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          , San Juan, E. (eds.)
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          September, Toulouse, France.
          <source>CEUR-WS.org (Sep</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A low dimensionality representation for language variety identification</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing</source>
          <year>2016</year>
          ). Springer-Verlag (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chugur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trenkmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd Author Profiling Task at PAN 2014</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Halvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Kraaij</surname>
          </string-name>
          , W. (eds.)
          <article-title>CLEF 2014 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>15</volume>
          -
          <fpage>18</fpage>
          September, Sheffield, UK. CEUR-WS.
          <source>org (Sep</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inches</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Profiling Task at PAN 2013</article-title>
          . In: Forner,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Tufis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>23</volume>
          -
          <fpage>26</fpage>
          September, Valencia,
          <source>Spain (Sep</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          , T. (eds.)
          <source>Working Notes Papers of the CLEF 2017 Evaluation Labs</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEURWS.org (Sep</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Sadat</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazemi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farzindar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic identification of arabic language varieties and dialects in social media</article-title>
          .
          <source>In: In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          . In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. vol.
          <volume>6</volume>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A report on the first native language identification shared task</article-title>
          .
          <source>In: Proceedings of the eighth workshop on innovative use of NLP for building educational applications</source>
          . pp.
          <fpage>48</fpage>
          -
          <lpage>57</lpage>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gebre</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          :
          <article-title>Automatic identification of language varieties: The case of Portuguese</article-title>
          .
          <source>In: KONVENS2012-The 11th Conference on Natural Language Processing</source>
          . pp.
          <fpage>233</fpage>
          -
          <lpage>237</lpage>
          .
          <source>Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubešic</surname>
            <given-names>´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>A report on the DSL Shared Task 2014</article-title>
          .
          <source>In: Proceedings of the COLING First Joint Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial)</source>
          . pp.
          <fpage>58</fpage>
          -
          <lpage>67</lpage>
          . Dublin, Ireland (
          <year>August 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubešic</surname>
            <given-names>´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Overview of the DSL Shared Task 2015</article-title>
          .
          <source>In: Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial)</source>
          . Hissar,
          <string-name>
            <surname>Bulgaria</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>