<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Including Dialects and Language Varieties in Author Profiling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alina Maria Ciobanu</string-name>
          <email>alina.ciobanu@my.fmi.unibuc.ro</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcos Zampieri</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shervin Malmasi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liviu P. Dinu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Harvard Medical School</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Cologne</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word ngrams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender identification on tweets written in four languages and 97% accuracy on language variety identification for Portuguese.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        With vast amounts of texts available on social media, author (or authorship) profiling
has become a popular research area in NLP. A number of characteristics such as age
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], gender [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and native language [
        <xref ref-type="bibr" rid="ref12 ref7">7,12</xref>
        ] can be predicted based on the topics and
the linguistic properties present in a person’s writings.
      </p>
      <p>
        The PAN labs1 at CLEF have been providing a forum for scholars to evaluate
authorship profiling approaches on user-generated content. Author profiling tasks organized
in the past PAN labs included age, gender, and personality traits prediction [
        <xref ref-type="bibr" rid="ref25 ref26">25,26</xref>
        ].
This year, for the first time PAN includes language varieties and dialects from four
languages: Arabic, English, Portuguese, and Spanish along with gender identification.2
      </p>
      <p>
        This paper describes computational methods for gender and language variety
identification on social media. Our approach builds on the experience acquired in the previous
gender identification tasks of the PAN labs and the four editions of the Discriminating
between Similar Languages (DSL)3 shared task organized at the workshop on Similar
1 http://pan.webis.de/
2 In this paper we make a terminological distinction between (standard national) language
varieties and dialects. We consider English, Spanish, and Portuguese to be pluricentric languages
each of them including their own standard national language varieties. The situation of Arabic
is, however, different as Modern Standard Arabic (MSA) co-exists with several Arabic dialects
in a diglossic situation. Nevertheless, the challenges faced by systems trained to discriminate
between similar languages, language varieties, and dialects are identical.
3 http://ttg.uni-saarland.de/vardial2017/sharedtask2017.html
Languages, Varieties and Dialects (VarDial) [
        <xref ref-type="bibr" rid="ref18 ref34 ref36 ref37">36,37,18,34</xref>
        ]. The DSL shared tasks
included all languages4 and most of the dialects and language varieties included in the
PAN lab 2017 thus establishing benchmarks for language variety and dialect
identification.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The inclusion of language varieties at PAN is motivated by the growing interest in
dialect and language variety identification evidenced by several research papers and
the aforementioned DSL and ADI shared tasks. Examples of such studies include
Portuguese varieties [
        <xref ref-type="bibr" rid="ref33 ref35 ref4">33,35,4</xref>
        ], English varieties [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Romanian dialects [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Chinese
varieties [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], and a number of studies on Arabic dialect identification [
        <xref ref-type="bibr" rid="ref15 ref27 ref29 ref32">29,32,27,15</xref>
        ].
      </p>
      <p>
        The DSL and ADI shared task reports and their respective system description papers
provide valuable information about successful approaches for dialect, language variety,
and similar language identification. Successful approaches such as those by Goutte et
al. (2014) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Malmasi and Dras (2015) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Malmasi and Zampieri (2016) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and
Bestgen (2017) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] rely on the combination of higher-order character n-grams (4 and
above), word n-grams, POS tags in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and multiple linear classifiers such as SVMs and
Naive Bayes arranged in ensembles and/or trained in a two-stage approach, in which
first the language is identified and subsequently individual classifiers are trained to
discriminate between language varieties or dialects of the same language.5 An exception is
the approach proposed by Ionescu and Butnaru (2017) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which achieved great results
for Arabic dialect identification relying on kernel learning.
      </p>
      <p>
        The main difference between the language variety sub-task at PAN and the DSL
and ADI shared tasks is the kind of data provided by the organizers. The PAN
challenge provides data collected from social media, whereas the data used in the DSL task
comes from newspapers [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and the ADI shared tasks used transcripts from broadcast
speeches along with audio features [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With respect to the data, the most similar task
to the PAN challenge is the 2014 TweetLID shared task [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] which included microblog
posts from the languages spoken in the Iberian Peninsula and English.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <sec id="sec-3-1">
        <title>Task and Data</title>
        <p>
          The organizers of the PAN challenge on author profiling provided participants with
a training set containing ~1,140,000 microblog posts from Twitter. Each post in the
training set was annotated with the user’s metadata including the language, language
variety or dialect, and gender. A test set including unlabeled posts was released a month
later.
4 Arabic dialect identification (ADI) was a sub-task of the DSL 2016 and an individual task in
the more comprehensive VarDial evaluation campaign 2017.
5 Goutte et al. (2016) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] provides a comprehensive evaluation of the first two editions of the
DSL shared task.
        </p>
        <p>The four languages and their respective varieties and dialects included in the PAN
2017 dataset are listed next.
The training set was annotated in XML format. Next we present an example of the
meta-data for a male English speaker from the United States.
&lt;author id="author-id" lang="en" variety="united states"
gender="male"&gt;
With the data provided by the PAN 2017 organizers in hand we trained SVM classifiers
to identify both the gender and the language variety or dialect of users. Participants
could choose to participate in any or both sub-tasks and we decided to participate in
both.</p>
        <p>
          Finally, it is worth noting that, unlike most NLP shared tasks, PAN requires
participants to run their scripts in a virtual machine provided by the organizers. This ensures
that all teams have the same computing power to participate in the challenge allowing
full reproducibility [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].6
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Approach</title>
        <p>
          We use a single-label multi-class classification approach based on SVM ensembles,
following the methodology proposed by Malmasi and Dras [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Classification ensembles are systems that combine the results of multiple
classifiers, with the purpose of improving the overall performance. Ensembles have been
successfully used in various research areas, such as complex word identification [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
or grammatical error diagnosis [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. The individual classifiers can differ in various
regards, such as training data, features or classification methods.
        </p>
        <p>
          In our system, the classifiers differ in terms of features. We use character n-grams
(with n in f1; :::; 6g) and word n-grams (with n in f1; 2g) and build a classifier for each
type of feature. Thus, our ensemble consists of eight individual classifiers. To combine
the classifiers, we employ a fusion method based on the probability estimates provided
by the individual classifiers: the predicted probabilities for each class are added, and
the prediction of the ensemble is the class with the highest sum. We use the SVM
implementation provided by Scikit-learn [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], based on the Libsvm library [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>We train the ensembles individually for predicting gender and language varieties.
We perform 3-fold cross-validation on the training dataset for hyperparameter tuning,
for each classifier, searching for the optimal value of C in f10 5; :::; 105g.
6 The PAN labs use TIRA (http://www.tira.io/) for reproducibility.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In the next two sections we present the results obtained by our method. Section 4.1
presents the results obtained using cross-validation on the training set. Section 4.2
presents the official results obtained using the PAN author profiling test set released
by the shared task organizers over a month after the training set was released.
4.1</p>
      <sec id="sec-4-1">
        <title>Cross-Validation</title>
        <p>The cross-validation results are reported in Table 1 with the best results presented in
bold. We note that the highest joint accuracy (when both the gender and language
variety are correctly predicted together) is obtained for Portuguese, where the system
obtains 0.75 accuracy. For gender identification, the highest accuracy of 0.79 is obtained
for English, while language variety is best predicted for Portuguese, with 0.97
accuracy. Portuguese also obtains the highest average accuracy of 0.83 (average of gender,
language variety and joint accuracy).
The high results obtained for Portuguese were not surprising, as there were only two
Portuguese varieties in the dataset, from Brazil and from Portugal. The dataset included
more varieties and dialects from the other four languages, namely: six English varieties,
seven Spanish varieties, and four Arabic dialects.</p>
        <p>
          The individual classifiers do not outperform, in any case, the ensembles. Portuguese
is the only language for which the best individual performance equals the performance
of the ensembles. For the others, the improvement reaches a maximum of 0.08 in
accuracy (for the English joint prediction) when using ensembles. For three languages
out of four (English, Spanish and Portuguese), word unigrams obtain the highest joint
accuracy from all the individual classifiers. For Arabic, character 4-grams obtain the
highest joint accuracy. As far as the language variety and gender labels are concerned,
character 4-grams, character 5-grams and word unigrams obtain better results than the
other types of features. For both gender and language variety identification, the best
results are obtained for Portuguese, using character 4-grams for gender identification
and word unigrams for language variety identification.
In the official evaluation carried out on the test set by the PAN organizers our system
was ranked 13th among 22 participating teams in both sub-tasks. The system achieved
0.7842 average average accuracy for language variety and gender identification. The
results and ranks are described in more detail in the PAN labs report [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and in the
author profiling task report [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
        </p>
        <p>In Table 2 we present the results obtained for language variety identification. For
reference we provide two baselines provided by the organizers: the BOW-baseline, a
bag-of-words model with the 1,000 most frequent items and the STAT-baseline, a simple
majority class baseline. As observed in the cross-validation experiments, the best results
in the test set were also obtained when discriminating between the two Portuguese
varieties achieving 0.9788 accuracy. On language variety identification our system achieved
an average performance of 0.8524 accuracy ranking 11th among 22 shared task entries.
In Table 3 we present the results obtained for gender identification with tweets from
different languages along with the two aforementioned baselines. This is a binary
classification setting in which the systems are trained to discriminate between tweets written
by male and female writers. The variable gender was constant between all languages
whereas the number of varieties and dialects for each language varied between 2 for
Portuguese and 7 for Spanish. For this reason we observed that the results across
languages for gender identification varied much less than the results obtained on language
variety/dialect identification.</p>
        <p>Our method obtained the best results for Portuguese tweets achieving 0.7713 and
the lowest results for Arabic achieving 0.7131 accuracy. The average performance of
our method on gender identification was 0.7504 accuracy ranking 12th among 22 shared
task entries.</p>
        <p>The results presented in this section indicate that our approach performed
substantially better than the two baselines provided and it was consistently ranked in the middle
of the table both for language variety and for gender identification. Even though the
results obtained by our method were not low, taking the experience obtained in the past
PAN labs and DSL shared tasks into account we expected our system to rank higher
in the official scores table. Possible factors that may have influenced the performance
of our method are: 1) the type of dataset used at PAN which contain very short and
non-standard texts, 2) the large size of the dataset that might have made possible for
the other teams to use innovative approaches (e.g. deep learning), and 3) our
implementation of the classifier which might not have been optimal. A thorough analysis of the
misclassified instances is being carried out to determine the reasons for this outcome
and possible ways to improve our system’s performance.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper presented an SVM ensemble-based system trained on character and word
n-grams developed for author profiling tested on the PAN 2017 dataset which takes
gender and language variety/dialect identification into account. The approach described
in our submission was inspired by successful submissions to past editions of the PAN
task on gender identification, to the Discriminating between Similar Languages (DSL),
and to Arabic Dialect Identification (ADI) shared tasks, the last two organized at the
VarDial workshop.</p>
      <p>In the training set cross-validation stage, our best results for gender identification
were obtained on English data, 0.79 accuracy, and the best results for language
variety identification were obtained for Portuguese, 0.97 accuracy. In the official evaluation
carried out on the test set our system was ranked 11th on language variety
identification and 12th on gender identification out of 22 submissions achieving 0.85 and 0.75
accuracy respectively.</p>
      <p>
        To the best of our knowledge, the PAN labs 2017 was the first shared task to
include language varieties and dialects in author profiling opening avenues for future
research. Regarding our system’s performance, there is still room for improvement.
We are currently investigating ways to improve our system’s performance by testing a
meta-classifier which achieved very good results on German dialect identification [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>We would like to thank the organizers of the PAN lab for proposing this
interesting shared task. Special thanks to Martin Potthast and Francisco Rangel for replying
promptly to all our inquiries and to Paolo Rosso for fruitful discussions and interesting
insights about author profiling during the last VarDial workshop at EACL 2017.</p>
      <p>Liviu P. Dinu is supported by UEFISCDI, project number 53BG/2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dehak</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardinal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khurana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yella</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glass</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Renals</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Automatic Dialect Detection in Arabic Broadcast Speech</article-title>
          .
          <source>In: Proceedings of INTERSPEECH</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bestgen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bestgen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitório</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Smoothed N-gram based Models for Tweet Language Identification: A Case Study of the Brazilian and European Portuguese National Varieties</article-title>
          . Applied Soft Computing (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <issue>5</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: A Library for Support Vector Machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <volume>27</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          :
          <fpage>27</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ciobanu</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinu</surname>
            ,
            <given-names>L.P.:</given-names>
          </string-name>
          <article-title>A Computational Perspective on Romanian Dialects</article-title>
          .
          <source>In: Proceedings of LREC</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gebre</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wittenburg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heskes</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Improving Native Language Identification with TF-IDF Weighting</article-title>
          .
          <source>In: Proceedings of the BEA workshop</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Goutte</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Léger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpuat</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The NRC System for Discriminating Similar Languages</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Goutte</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Léger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Discriminating similar languages: Evaluations and explorations</article-title>
          .
          <source>In: Proceedings of LREC</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ionescu</surname>
          </string-name>
          , R.T.,
          <string-name>
            <surname>Butnaru</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning to identify Arabic and German dialects using multiple kernels</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Classifying English Documents by National Dialect</article-title>
          .
          <source>In: Proceedings of ALTA</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Measuring Feature Diversity in Native Language Identification</article-title>
          .
          <source>In: Proceedings of the BEA Workshop</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Language identification using classifier ensembles</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : LTG at SemEval-2016
          <source>Task</source>
          <volume>11</volume>
          :
          <article-title>Complex Word Identification with Classifier Ensembles</article-title>
          .
          <source>In: Proceedings of SemEval</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Refaee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Arabic Dialect Identification using a Parallel Multidialectal Corpus</article-title>
          .
          <source>In: Proceedings of PACLING</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Arabic Dialect Identification in Speech Transcripts</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>German dialect identification in interview transcriptions</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Ljubešic´,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          , J.:
          <article-title>Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gravel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trieschnigg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meder</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>"How Old Do You Think I Am?"; A Study of Language and Age in Twitter"</article-title>
          .
          <source>In: Proceedings of ICWSM</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trieschnigg</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Dog˘ruöz,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Gravel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Theune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Meder</surname>
          </string-name>
          , T., de Jong, F.:
          <article-title>Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment</article-title>
          .
          <source>In: Proceedings of COLING</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          .
          <source>In: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : Overview of PAN'17:
          <string-name>
            <surname>Author</surname>
            <given-names>Identification</given-names>
          </string-name>
          , Author Profiling, and
          <string-name>
            <given-names>Author</given-names>
            <surname>Obfuscation</surname>
          </string-name>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          . In: Proceedings of CLEF (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations</article-title>
          .
          <source>Proceedings of CLEF</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Sadat</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazemi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farzindar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic Identification of Arabic Language Varieties and Dialects in Social Media</article-title>
          .
          <source>In: Proceedings of the SocialNLP Workshop</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Ljubešic´,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          , J.:
          <article-title>Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection</article-title>
          .
          <source>In: Proceedings of the BUCC Workshop</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Tillmann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mansour</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Improved Sentence-Level Arabic Dialect Classification</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , Han,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          :
          <article-title>Chinese Grammatical Error Diagnosis Using Ensemble Learning</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications</source>
          . pp.
          <fpage>99</fpage>
          -
          <lpage>104</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Sentence-level dialects identification in the Greater China region</article-title>
          .
          <source>International Journal on Natural Language Computing (IJNLC) 5</source>
          (
          <issue>6</issue>
          ) (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zaidan</surname>
            ,
            <given-names>O.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Arabic Dialect Identification</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>40</volume>
          (
          <issue>1</issue>
          ),
          <fpage>171</fpage>
          -
          <lpage>202</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gebre</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          :
          <article-title>Automatic Identification of Language Varieties: The Case of Portuguese</article-title>
          .
          <source>In: Proceedings of KONVENS</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubešic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiedemann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherrer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aepli</surname>
          </string-name>
          , N.:
          <article-title>Findings of the VarDial Evaluation Campaign 2017</article-title>
          .
          <source>Proceedings of the VarDial Workshop</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulea</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinu</surname>
            ,
            <given-names>L.P.:</given-names>
          </string-name>
          <article-title>A Computational Approach to the Study of Portuguese Newspapers Published in Macau</article-title>
          .
          <source>In: Proceedings of the NLP Meets Journalism Workshop</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubešic</surname>
            <given-names>´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>A report on the DSL shared task 2014</article-title>
          .
          <source>In: Proceedings of the VarDial Workshop</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubešic</surname>
            <given-names>´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Overview of the DSL shared task 2015</article-title>
          .
          <source>In: Proceedings of the LT4VarDial Workshop</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Zubiaga</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , San Vicente, I.,
          <string-name>
            <surname>Gamallo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pichel</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alegria</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aranberri</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ezeiza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fresno</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Overview of TweetLID: Tweet language identification at SEPLN 2014</article-title>
          .
          <source>In: Proceedings of the TweetLID Workshop</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>