<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Gender prediction using lexical, morphological, syntactic and character-based features in Dutch</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bibliography</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work is a result of participation in shared task on gender detection in Dutch. The task was to predict gender within and across di erent genres. This work applies some existing ideas about using lexical and more abstract text representations (morphological, syntactical labels, text bleaching). It provides a comparison of di erent features across genres in two types of tasks and presents two pipelines. Using three types of features, we found that lexical features are more signi cant, although other features also show good results making the model more robust. Final scores where in range 0.61-0.64 for in-genre and 0.53-0.56 for cross-genre prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Task and data</title>
      <p>The task was to detect gender of the author in three genres: news, Twitter posts
and Youtube comments, using training set of the same genre as test set (in-genre
prediction) or di erent genre(s) (cross-genre prediction).</p>
      <p>Training set size (genders are balanced within sample):
{ 1832 news texts ( 340000 tokens)
{ 20000 Twitter posts ( 380000 tokens)
{ 14744 YouTube comments ( 280000 tokens)
Type of Text
Original text
Lemmatized text
POS
Syntactic relations
CV-mask
UL-mask
Word length</p>
      <p>Example
Deze video bekijken terwijl je een vrouw bent!
deze video bekijken terwijl je een vrouw bent!
DET NOUN VERB SCONJ PRON DET NOUN NOUN PUNCT
det obj root mark nsubj det advcl nmod punct
cvcv cvcvv cvcvccvc cvccvcc cv vvc ccvvc cvcc!
ULLL LLLLL LLLLLLLL LLLLLLL LL LLL LLLLL LLLL!
4 5 8 7 2 3 5 4 !
To nd what features characterize language of men and women we need to explore
the di erent levels of the text from phonological or graphical to syntactic and
semantic level. Preprocessing allows to create a text representation (a modi ed
version of text that preserves its certain features and neutralize other ones).
These representations correspond to di erent text levels and will be used later
for feature extraction.</p>
      <p>
        Some works (e.g.
        <xref ref-type="bibr" rid="ref3">(van der Goot et al., 2018)</xref>
        ) have shown that lexical features
demonstrate the highest score. There is also a hypothesis that for cross-genre
prediction we need to use more abstract features (not text speci c, but
independent e.g. character-based). Being so general, they may help us achieve higher
results in cross-genre models because text-speci c lexical features intuitively are
more genre-speci c. This needs to be checked in our experiments.
      </p>
      <p>We use three groups of features in ascending order of their abstractness:
{ Lexical
{ Morphological and syntactic
{ Character-based</p>
      <p>
        Lexical features can be either tokens or lemmas. In our case, these are lemmas,
since a relatively small amount of training set (both in token and documents)
does not allow the use of tokens. The second group is presented by part-of-speech
(POS) tags and labels of syntactic relations. The third group comprises text
bleaching
        <xref ref-type="bibr" rid="ref3">(van der Goot et al., 2018)</xref>
        features. These features are very abstract
and in this work character-based. This makes them applicable for cross-genre
prediction as well as for in-genre one. They are consonant/vowel mask (texts is a
sequence of marks showing whether this character denotes vowel or consonant),
upper/lower case mask and word lengths (in characters). Example of di erent
text representations is demonstrated in Table 1.
      </p>
      <p>Each language has its own characteristics, which are taken into account in
specialized tools for working with it, but it is di cult for an external researcher
(who has no experience with this language and does not speak this language)
to nd and use them properly. On the one hand, some lexical features may be
selected and checked on credibility only working with semantics that requires
Representation News Twitter YouTube
Lemmatized 0.665 0.622 0.598
POS 0.574 0.568 0.548
Syntax 0.579 0.553 0.538
CV-mask 0.625 0.587 0.606
UL-mask 0.598 0.590 0.583</p>
      <p>
        Word length 0.578 0.58 0.584
language knowledge. On the other hand, many NLP tasks (e.g. translation) can
be solved without any knowledge of processed language(s). Therefore, in this
paper we use tools available for a wide range of languages and that can be used
without speci c knowledge of Dutch: available pre-trained Word2Vec (W2V)
        <xref ref-type="bibr" rid="ref1">(Mikolov et al., 2013)</xref>
        and UDPipe
        <xref ref-type="bibr" rid="ref2">(Straka and Strakova, 2017)</xref>
        models. The
former is used for working on lexical features and the latter for lemmatization,
part-of-speech tagging and extracting syntactical information.
4
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature extraction</title>
      <p>In the rst experiment we want to compare usefulness of each type of
representation. We use TF-IDF vectorizer from sklearn Python library as a feature
extraction instrument and logistic regression as a classi er. This step is necessary
because we need to determine the potential of each type of text representation
and examine a range of accuracy scores that we can expect from our model
on the nal stage. We can use accuracy metric since the sample is balanced
(50/50 texts by men and women). Table 2 shows scores gained using di erent
text representations.</p>
      <p>As we can see in Table 2, the most useful features are lemmatized text
and abstract character-based masks. However, we cannot exclude other features
because their scores are also good on this scale and they may show themselves
better in later experiments.</p>
      <p>The next step for working with vocabulary is to combine TF-IDF and
Word2Vec features. This can be gained by multiplying matrices with TF-IDF
and W2V vectors in order to get one vector for each text. The result vector of
this multiplication corresponds to weighted (TF-IDF) sum of semantic vectors of
words in each text.
5</p>
    </sec>
    <sec id="sec-3">
      <title>The experiment with combined features</title>
      <p>The next step is to combine all features in order to stabilize our model and make
it more robust. As far as we might send two predictions for each task, we decided
to try two combinations of features.</p>
      <p>The rst pipeline consists of following groups of features:
{ TF-IDF (using words) with n-gram range from 1 to 4: W2V vectors with
TF-IDF weights, POS tags, syntactic labels, CV-mask, UL-mask and word
lengths
{ TF-IDF (using characters) with n-gram range from 1 to 4: tokenized text,</p>
      <p>CV-mask, UL-mask
{ TF-IDF + W2V vectors of lemmatized text</p>
      <p>In case of POS tags and other non-lexical features TF-IDF vectorizer
considered tags or labels as real words of metalanguage.</p>
      <p>The second pipeline follows the rst one, excluding lexical features (W2V).
Table 3 presents a comparison of these two pipelines. We can see that the gap
between the pipelines is very small, so using only abstract features gives a result
comparable to complex features as W2V.</p>
      <p>As we saw in Table 2, using only lexical features gives us a very good result,
on news texts even better result. There are two reasons why we didn't go back to
using exclusively this representation. The rst one is the instability of vocabulary:
we cannot guarantee that new data will follow the vocabulary we have, so the
results are very unpredictable. The second one is homogeneity of our pipeline:
our aim was to create a uniform pipeline for all genres, not separate models that
show good results on cross validation.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Final results</title>
      <p>Each task allowed two submissions (prediction from models of both pipelines).
So we could test two combinations of features and check the hypothesis about
robustness and quality of model based on more abstract features. For in-genre
tasks, models were trained on sample of this particular genre while for cross-genre
tasks we used two other genres, but not the target one (e.g. news + Twitter for
Youtube comments).</p>
      <p>News Twitter Youtube</p>
      <p>P1 P2 P1 P2 P1 P2
In 0.637 0.619 0.624 0.612 0.633 0.623
Cross 0.534 0.554 0.558 0.547 0.541 0.522</p>
      <p>Table 4. Final accuracy score</p>
      <p>The hypothesis that more abstract features (comparing with lexical ones)
would be better in cross-genre tasks was refuted. We can see in Table 4 that
the second pipeline achieved better results only in news genre. Probably, this
happened because personal short text di ers from the more formal one in terms
of vocabulary. News texts represent a more literary language, while tweets and
comments demonstrate more spoken language. Moreover, the themes of these
texts are di erent.
7</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In general, the results were lower than expected, although this may be because of
the size of the sample or the weak gender di erences in Dutch. As we can see in
Table 4, a large gap exists between in-genre and cross-genre scores. Consequently,
we can conclude that the features learned by models are rather genre-speci c
than general. We consider the model successful in terms of in-genre detection,
because its score is above 0.6, while the score in case of cross-genre detection
only slightly excels random choice.
8</p>
    </sec>
    <sec id="sec-6">
      <title>Further work</title>
      <p>This pipeline provides good results, but further work may include better feature
selection. In our case there are thousands of numeric features obtained from
di erent text representation, but we can expect that a lot of them are very noisy
and have to be excluded from the result matrix of features. This can ameliorate
existing model. Moreover, experiments with more sophisticated classi er and
parameter selection may improve our score as well.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank Malvina Nissim for a very inspiring talk about text
bleaching on the conference in Nizhny Novgorod (Russia, Autumn 2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen,
          <source>Greg Corrado, and Je rey Dean</source>
          .
          <year>2013</year>
          .
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Milan</given-names>
            <surname>Straka</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jana</given-names>
            <surname>Strakova</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe</article-title>
          .
          <source>In Proceedings of the CoNLL</source>
          <year>2017</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies . Association for Computational Linguistics</article-title>
          , Vancouver, Canada, pages
          <volume>88</volume>
          {
          <fpage>99</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Rob van der Goot</surname>
            , Nikola Ljubesic, Ian Matroos, Malvina Nissim, and
            <given-names>Barbara</given-names>
          </string-name>
          <string-name>
            <surname>Plank</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bleaching text: Abstract features for cross-lingual gender prediction</article-title>
          . CoRR abs/
          <year>1805</year>
          .03122.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>