<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Embedding and Clustering for Cross-Genre Gender Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bibliography</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLCG, Information Science University of Groningen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>For CLIN 2019 a shared task on binary gender prediction within and across di erent genres in Dutch was issued. This paper reports on the ndings of team `Rob's Angels` done in light of this shared task. A multitude of linear SVM models were created to predict gender in di erent genres (Twitter, YouTube and news), and cross-genre. Our best models used Twitter word-embeddings, in combination with removal of stopwords and tokenization of the text. We also introduced a novelty in classifying the news corpus. The large instances of news data are split into smaller parts, individually classi ed, and then the text as a whole is assigned a label based on majority voting. We eventually nished eighth on the in-genre category with an average accuracy of 0.617 and fourth on the cross-genre category with an average accuracy of 0.547.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Picture a news article. It starts with a title, then a short intro, followed by
the name of the author. In many cases (if the name is not ambiguous) you do
not know if the author is a man or a woman. In case of a tweet, you often can
determine gender from the pro le picture. But what if you remove this kind of
information and are left with only the actual text? If the text is written on paper,
you could try to infer gender from handwriting, as Hamid and Loewenthal (1996)
did. But nowadays, many texts are written in an electronic environment. Does
this mean that there is no way of telling if a text is written by a man or a woman?
This is the task of gender identi cation. Can you, based on the text someone
has written, predict if this person is either a man or a woman? In this paper, we
examine three kinds of text, from di erent genres: tweets, YouTube comments
and news articles. Both, in-genre and cross-genre predictions are done, i.e. train
on one genre and predict for the same genre and train on everything but one
genre and test on that left out genre. The main focus will be on the cross-genre
task, since the "o cial" winner of the shared task is determined based on the
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)
cross-genre ranking. The best cross-genre model will also be used for the in-genre
predictions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Gender prediction is something which is a subtask of author pro ling. Much
previous work has been done on author pro ling, but most of it was conducted on
a single domain. However the focus of this shared task is on cross-genre modelling.
A similar shared task has been held for Italian texts
        <xref ref-type="bibr" rid="ref2">(Dell'Orletta and Nissim,
2018)</xref>
        . The participants were provided with di erent genres: Twitter, YouTube,
Children writing, Journalism and Personal diaries. Given texts from these genres,
the task was to predict gender in-domain and cross-domain. The datasets for both
training and test were balanced in terms of gender label distribution (50:50). For
in-genre prediction, the highest score was obtained for the personal diaries genre,
namely an accuracy of 0.676. In a cross-genre setting, the highest accuracy was
0.640 for the Children writing genre. The hardest genre seemed to be Journalism,
since for both in- and cross-genre this genre had the lowest average scores when
taking all teams into account. Multiple models have been used, like an SVM,
Logistic Regression, Random Forest or Bi-LSTM. Overall, neural models performs
slightly better than a classis SVM, but the di erence is minimal.
      </p>
      <p>
        Another similar shared task has been done on Russian as well
        <xref ref-type="bibr" rid="ref5">(Litvinova et al.,
2017)</xref>
        . The set up however is slightly di erent since participants were only given
a training set on Twitter. Their models were evaluated on ve genres: essays,
Facebook, Twitter, reviews and text where the authors imitated the other gender.
Again, multiple machine learning techniques have been used like SVM's or neural
networks. Deep learning techniques seemed to work best for the essay and gender
imitation genre, while a SVM with combinations of n-grams performed better on
the Facebook, Twitter and reviews genre.
      </p>
      <p>So, judging from the results of these similar shared tasks, both SVM's and
neural networks seemed to perform well for di erent genres. Due to the simplicity
and e ciency of SVM as a machine learning technique, we will mainly focus on
SVM models.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>The data used for this shared task is in Dutch and is retrieved from the following
three genres: Twitter, YouTube and news. For Twitter, a set of tweets is provided,
for YouTube a set of video comments and for news a set of articles. Every instance
of text in the dataset had an id number, a genre label and a gender label. See
below for a sample instance from the Twitter data:
[fontsize=\small]
&lt;doc id="36" genre="twitter" gender="F"&gt;
Hou van mijn vrienden
&lt;/doc&gt;</p>
      <p>Among all genres, the gender distribution was balanced (50:50). Table 1 shows
the distribution of data between the three genres.</p>
      <p>The number of instances di er a lot between the three genres, but the number
of tokens are of more comparable sizes. The amount of instances of the news
genre is relatively low when compared to the other two genres, but the number
of average tokens per instance reveal that these texts are considerably bigger.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Method</title>
      <p>The main focus of this paper is on the cross genre task. For this task we rst
decided to train on Twitter data to predict on the YouTube data, train on
the YouTube data to predict on the Twitter data and train on the Twitter
and YouTube data together, to predict on the news data. This is because the
Twitter and YouTube data have more or less the same length per message and
the messages contain less standard text. This will probably lead to the use of
di erent kind of words than in the news texts. The decision to use both Twitter
and YouTube to train for the prediction of news is based on the fact that we
wanted as much training data as possible. This is because the style of the news
genre is completely di erent from the Twitter and YouTube.</p>
      <sec id="sec-4-1">
        <title>Preprocessing</title>
        <p>We experimented with a variety of preprocessing steps:
{ removal/replacement of emojis: use regular expressions to nd emoticons,
symbols &amp; pictographs, transport &amp; map symbols and ags.
{ removal/replacement of username: use regular expressions to nd usernames
(starting with @)
{ removal of stopwords: While using embeddings we removed stopwords with
the NLTK stopwords-list, to get a more accurate average embedding.
{ tokenization: use nltk word tokenizer to separate the words
{ POS-tagging: use POS-tags from spacy model nl core news sm
{ removal of links: use regular expressions to nd the links (starting with
http(s))
In case of replacement, emojis were replaced with `&lt;EMOJI&gt;' and usernames
were replaced with `&lt;USERNAME&gt;'.</p>
        <p>Twitter YouTube News
Instances (I) 20,000 14,744 1,832
Tokens (T) 469,105 300,360 382,146
Avg. T/I 23.5 20.4 208.6</p>
        <p>Table 1. Data distribution of the genres</p>
      </sec>
      <sec id="sec-4-2">
        <title>Classi er</title>
        <p>We experimented with a support vector machine (SVM), speci cally a LinearSVC.
For the tokenized words and the POS-tags we experimented with a T df-vectorizer
with di erent n-gram ranges (unigrams, bigrams and trigrams and combinations
of these 3).</p>
      </sec>
      <sec id="sec-4-3">
        <title>Split news</title>
        <p>In order to make the news data more similar to the short Twitter and YouTube
texts, we split up the news articles. We experimented with a split every two, three
and four sentences. Only the last split could consists of less sentences, because
e.g when you split every three sentences and a article has 20 lines, the last split
will consist of 2 sentences in stead of 3. For every split the system predicts if
the gender is male or female, and in the end majority voting is used to get the
nal gender. In case of an equal score, the gender is set to female. This is purely
based on that female gave higher scores than male for the given data.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Brown clusters</title>
        <p>We also experimented with the use of Brown clusters. The Brown clusters that
are used are created by Bouma (2015), on the basis of 594 million Dutch tweets
(5.8 billion tokens). They are collected between 2011 and 2014, using the method
described in the paper of Sang and van den Bosch (2013). We used the Brown
clusters to replace all tokens with the corresponding cluster. When the token
did not have a corresponding cluster, we assigned it the value `unk'. For each
sentence, the clusters where then put together in one string.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Word embeddings</title>
        <p>Our best model used Twitter embeddings created as described in the paper
by van der Goot and van Noord (2017). However, for these embeddings the
default settings of the word2vec are used, but with skip-grams. We translated
each token of each instance into it's 100 dimensional embedding. When a token
had no equivalent embedding, we skipped this token. When we had a list of
100 dimensional embeddings for each instance of the data-set, we then used the
average 100 dimensional embedding of each instance, and fed this to the classi er
as features.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>The baseline for this shared task was a majority baseline of an accuracy of 50%.
Our best results on the development and test set on in-genre classi cation are
reported in Table 2. The classi er used to get these results is a LinearSVC with
the C parameter set to 1. The results of the di erent features and pre-processing
Genre Development Test
Twitter 0.648 0.648
YouTube 0.604 0.594
News 0.633 0.609</p>
      <p>Avg. 0.628 0.617</p>
      <p>Table 2. In-genre results on development and test data
steps for cross-genre classi cation on the development data are reported in Table
3, 4 and 5.</p>
      <p>All our best models, both cross- and in-genre, used the following preprocessing
steps: tokenization, link removal, stopwords removal, replacement of usernames
and the replacement of emojis. Furthermore, all systems used a 100 dimension
word embedding model which was trained on Twitter data. The model used
unigrams only, since the embedding model was trained on unigrams only.</p>
      <p>For in-genre classi cation, the preprocessing step stopwords removal, was not
incorporated for the YouTube model. The YouTube model performed less on the
development data when stopwords were removed, while the Twitter and news
models performed better.</p>
      <p>The use of Brown clusters improved the accuracy scores on Twitter and
YouTube slightly compared to using n-grams, but on news accuracy scores
dropped. When we started working with the Twitter embeddings, we found out
that training on news and testing on Twitter gave better results, compared to
training on YouTube. For testing on YouTube, training on news with embeddings
1 F1 = tokenization, F2 = POS-tagging, F3 = replace username + emojis, remove
links, F4 = Brown clusters, F5 = word embeddings, F6 = remove stopwords
did not improve the results, probably because the embeddings are trained on
Twitter data. We also discovered that, using embeddings, training on only
Twitter data to predict on news improved the results over training on Twitter
and YouTube together.</p>
      <p>Only the news model used a slightly di erent approach. As described in Section
4, news articles were split up every three sentences and classi ed according to
majority voting. In case of a tie, the female label was given since this resulted in
higher scores for the development data.</p>
      <p>The results on the test set on cross-genre classi cation are shown in Table 6.
The main focus of our research was to nd improvements for cross-genre gender
prediction models. After initial low results with using word n-grams and POS-tags,
experimented with Twitter-embeddings. Ideally we would have gotten embeddings
trained on YouTube and news data as well, but we did not have another dataset
for these genres, and were limited by time. We considered training on the test
corpus of these genres, but they were too small for training embeddings. Also,
training embeddings on a corpus that we had to test on seemed like cheating.</p>
      <p>The implementation of the Twitter embeddings boosted our accuracy scores.
Unsurprisingly the cross-genre predictions on the Twitter data got the biggest
boost, but the other two genres also saw an increase in accuracy scores. The
most surprising nding in our results was the high accuracy score we got on the
Twitter corpus, with a model trained on news data. At rst we did not even try
this option, since the Twitter and news corpus are fundamentally di erent, and
we thought we would get better results training on the YouTube data. However,
when training on the YouTube corpus we got an accuracy of 0.57 on Twitter,
while when training on news we got 0.584.</p>
      <p>Since the instances in the news corpus have on average much more tokens than
instances in the other two corpora, our results would suggest that for cross-genre
training, it is better to get an average embedding for each label over a larger
amount of data. This extra data seems to increase the quality of the embedding.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Based on our results, we can conclude that using embeddings as features for an
SVM-classi er works better in cross-genre gender prediction than using Brown
clusters as features. We also found that when training a classi er for cross-genre
predictions with word embeddings, it is better to train on a corpus that has
instances with a longer length (news) even if they are less similar to the predicted
genre (Twitter) than other corpora (YouTube). We believe that the extra length
of the training instances contributes to higher quality average embeddings, which
makes the classi er more e ective.</p>
      <p>Of course we must be careful with the last conclusion, since it could also be
that the YouTube corpus was of low quality or less similar to the twitter corpus
than we thought. One way we could test the suggestion above is to test on news
di erently. Our current model splits news in tweet-size snippets, classi es them
and then takes the most predicted model. However, we might have had better
results with the opposite approach; combine a certain amount of twitter messages
with the same label, train the model on these larger twitter texts (preferably with
news-embeddings), and test on the news corpus. This might be an interesting
approach for future research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Gosse</given-names>
            <surname>Bouma</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>N-gram frequencies for Dutch twitter data</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          ,
          <volume>5</volume>
          :
          <fpage>25</fpage>
          {
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 cross-genre gender prediction (GxG) task</article-title>
          .
          <article-title>Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18), Turin, Italy</article-title>
          . CEUR. org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Rob van der Goot</surname>
          </string-name>
          and Gertjan van Noord.
          <year>2017</year>
          .
          <article-title>MoNoise: Modeling noise using a modular normalization system</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          ,
          <volume>7</volume>
          :
          <fpage>129</fpage>
          {
          <fpage>144</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Sarah</given-names>
            <surname>Hamid</surname>
          </string-name>
          and Kate Miriam Loewenthal.
          <year>1996</year>
          .
          <article-title>Inferring gender from handwriting in Urdu and English</article-title>
          .
          <source>The Journal of social psychology</source>
          ,
          <volume>136</volume>
          (
          <issue>6</issue>
          ):
          <volume>778</volume>
          {
          <fpage>782</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Francisco M Rangel Pardo, Paolo Rosso, Pavel Seredin, and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Litvinova</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the RUSPro ling PAN at re track on cross-genre gender identi cation in Russian</article-title>
          .
          <source>In FIRE (Working Notes)</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          7.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Erik</given-names>
            <surname>Tjong Kim Sang</surname>
          </string-name>
          and Antal van den Bosch.
          <year>2013</year>
          .
          <article-title>Dealing with big data: The case of Twitter</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          ,
          <volume>3</volume>
          (
          <fpage>121</fpage>
          -134):
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>