<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identi cation and Classi cation of Misogynous Tweets Using Multi-classi er Fusion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Han Liu</string-name>
          <email>LiuH48@cardiff.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatima Chiroma</string-name>
          <email>fatima.chiroma@port.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihaela Cocea</string-name>
          <email>mihaela.cocea@port.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science and Informatics Cardi University</institution>
          ,
          <addr-line>Cardi</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing, University of Portsmouth</institution>
          ,
          <addr-line>Portsmouth</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>268</fpage>
      <lpage>273</lpage>
      <abstract>
        <p>For this study, we used the Doc2Vec embedding approach for feature extraction, with the context window size of 2, minimum word frequency of 2, sampling rate of 0.001, learning rate of 0.025, minimum learning rate of 1.0E-4, 200 layers, batch size of 10000 and 40 epochs. Distributed Memory (DM) is used as the embedding learning algorithm with the negative sampling rate of 5.0. Before feature extraction, all the tweets were pre-processed by converting the characters to their lower case, removing stop words, numbers, punctuations and words that contain no more than 3 characters as well as stemming all the kept words by Snowball Stemmer. Additionally, three classi ers are trained by using SVM with a linear kernel, random forests (RF) and gradient boosted trees (GBT). In the testing stage, the same way of text pre-processing and feature extraction is applied to test instances separately, and each pair of two out of the three trained classi ers (SVM+RF, SVM+GBT and RF+GBT) are fused by combining the probabilities for each class by averaging.</p>
      </abstract>
      <kwd-group>
        <kwd>Misogynous</kwd>
        <kwd>Multi-classi er Fusion</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Social media platforms have provided users the ability to freely express
themselves, however, it has also resulted to increase in cyberhate such as bullying,
threats and abuse. A study has shown how 67% of teenagers between the ages
of 15 to 18-year olds have been exposed to hate materials on social media, with
21% becoming victims of such materials [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another type of cyberhate that is
increasing and worrying is the use of hateful language speci cally misogyny on
social media platforms like Twitter [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Misogyny is de ned as a particular type of hate speech that is targeted
towards women [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], it is stated that online misogyny or abuse is linked
to domestic violence against women o ine. For example, 48% of women in the
UK that have been victims of domestic violence have also been victims of online
2
abuse. Likewise, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] stated that misogynist abuse as well as threats that are
targeted towards many women are ampli ed due to other social media users joining
in for entertainment or to drive out the targeted user.
      </p>
      <p>Therefore, due to increasing evidence showing that cyberhate is increasingly
becoming a threat to the society, it has become necessary to implement
techniques that can be automated to classify cyberhate so as to reduce the burden
on those responsible for public safety. Hence, the aim of this study is to: 1.
Identify and distinguish between misogynous and non-misogynous contents; 2.
Classify misogynistic behaviour into several behavioural types; and 3. Identify if
the target of the misogynistic behaviour is active or passive.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Description</title>
      <p>
        The experiment was carried out using misogynous text collected from Twitter [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
which was made available as a training set (with labels) and a separate test set
(without the labels).
      </p>
      <p>
        Due to the noisy nature of social media data [
        <xref ref-type="bibr" rid="ref2 ref5 ref6 ref8">2, 5, 6, 8</xref>
        ], it is necessary to
rigorously pre-process the data to improve its quality as well as the performance
of the classi ers [
        <xref ref-type="bibr" rid="ref2 ref8">2, 8</xref>
        ]. Therefore, the data sets were pre-processed and classi ed
using di erent machine classi ers.
      </p>
      <p>Figure 1 shows the experimental processes for this study, while subsequent
sections provide a detailed description of the experiment with the results of the
classi cation.
set contain only English language text with a total of 3, 977 instances as shown
in Table 1, which also contains a brief description of the features. Table 2 shows
a detailed description of the training set labels with the number of instances for
each category.</p>
      <p>
        Dataset
Training
Test
Total
The training and test datasets were separately pre-processed and ltered using
standard text pre-processing features. User names, URLs and non-ascii
characters were removed. The tweets were converted to lower case, and the stop words,
numbers and punctuation characters were ltered out. Words that contain less
than 3 characters were removed and all the remaining words were stemmed using
the Snowball Stemmer in Knime [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In addition to the standard pre-processing, features containing these four
labels: Id, Misogynous, Misogynous-category and Target were extracted
individually using the Doc2Vec Learner. The learner had a context window size and
4
minimum word frequency of 2, while the sampling rate of 0.001, learning rate
of 0.025, minimum learning rate of 1.0E-4, 200 layers, batch size of 10000 and
40 epochs. Also, the Distributed Memory (DM) is used as the embedding
learning algorithm with the negative sampling rate of 5.0. The extracted labels were
exported as tables for classi cation.
2.3</p>
      <p>Classi cation
The pre-processed training set was used to train ve classi ers using the following
machine learning algorithms: SVM with a linear kernel, random forests (RF),
gradient boosted trees (GBT), Decision Tree (DT) and Nave Bayes (NB). The
10-fold cross validation approach was used for evaluation and the results of this
experiment was used to determine the algorithms that had the best performance
among the ve machine learning algorithms used.</p>
      <p>To obtain the labels for the test set, the highest three performing trained
classi ers (determined as described in the previous paragraph) were used: SVM with
a linear kernal, random forest (RF) and gradient boosted trees (GBT). These
classi ers were paired and each pair of two out of the three trained classi ers,
i.e. SVM+RF, SVM+GBT and RF+GBT, were fused using algebraic fusion to
combine the probabilities for each class by averaging them. This experiment was
executed three times.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>
        In this section, the results of the experiment for the three runs will be discussed
and compared to the results published in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Additionally, the three machine
classi ers used were selected based on the results achieved when training the
classi ers. Table 3, shows the results achieved for the Misogyny Identi cation,
Misogynistic Behaviour Classi cation and Misogynistic Target Classi cation in
this study. Table 4 shows the best results obtained on the test set published
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Table 5 shows the results on our approach on the test set.
The results have shown that an accuracy up to 0.627, 0.247 and 0.623 were
achieved for the misogyny identi cation, misogynistic behaviour classi cation
and misogynistic target classi cation, respectively. Also, it has been observed
that the results in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has better accuracy for the misogyny identi cation and
misogynistic behaviour classi cation. We assume that the incompatibility of the
features in the training set with the features in the test set had an e ect on the
performance in this experimental study.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this study, text containing both misogynous and non-misogynous contents
were extracted from Twitter. The extracted training set was pre-processed and
used to train three machine classi ers. With this text, we were able to achieve
an accuracy of 0.624 (misogyny identi cation), 0.247 (misogynistic behaviour
classi cation) and 0.623 (misogynistic target classi cation). On the test set, the
performance was lower, which we believe is due to the incompatibility between
the features extracted from the training set and the ones extracted from the test
set - we will further investigate this issue. Additionally, we strongly believe it is
empirical to further improve the identi cation and classi cation performance for
misogynous contents on social media, speci cally Twitter, as this can potentially
safe lives.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anzovino</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fersini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic identi cation and classi cation of misogynistic language on twitter</article-title>
          .
          <source>In: International Conference on Applications of Natural Language to Information Systems</source>
          . pp.
          <volume>57</volume>
          {
          <fpage>64</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
          </string-name>
          , J.:
          <article-title>Robust sentiment detection on twitter from biased and noisy data</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on computational linguistics: posters</source>
          . pp.
          <volume>36</volume>
          {
          <fpage>44</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norrie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rumpel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wibberley</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Misogyny on twitter</article-title>
          .
          <source>Demos</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Berthold</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cebron</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gabriel</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          , Kotter, T.,
          <string-name>
            <surname>Meinl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sieb</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiswedel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>KNIME: The Konstanz Information Miner</article-title>
          . In:
          <article-title>Studies in Classi cation, Data Analysis, and Knowledge Organization (GfKL</article-title>
          <year>2007</year>
          ). Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Burnap</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colombo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Scour eld, J.:
          <article-title>Machine classi cation and analysis of suicide-related communication on twitter</article-title>
          .
          <source>In: Proceedings of the 26th ACM conference on hypertext &amp; social media</source>
          . pp.
          <volume>75</volume>
          {
          <fpage>84</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Colombo</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burnap</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hodorog</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scour</surname>
            <given-names>eld</given-names>
          </string-name>
          , J.:
          <article-title>Analysing the connectivity and communication of suicidal users on twitter</article-title>
          .
          <source>Computer communications 73</source>
          ,
          <volume>291</volume>
          {
          <fpage>300</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fersini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anzovino</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the task automatic misogyny identi cation at ibereval</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          ),
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ). pp.
          <volume>57</volume>
          {
          <fpage>64</fpage>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Haddi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The role of text pre-processing in sentiment analysis</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>17</volume>
          ,
          <issue>26</issue>
          {
          <fpage>32</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hewitt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiropanis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bokhove</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The problem of identifying misogynist language on twitter (and other online social spaces)</article-title>
          .
          <source>In: Proceedings of the 8th ACM Conference on Web Science</source>
          . pp.
          <volume>333</volume>
          {
          <fpage>335</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jane</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Online misogyny and feminist digilantism</article-title>
          .
          <source>Continuum</source>
          <volume>30</volume>
          (
          <issue>3</issue>
          ),
          <volume>284</volume>
          {
          <fpage>297</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Perry</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olsson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Cyberhate: the globalization of hate</article-title>
          .
          <source>Information &amp; Communications Technology Law</source>
          <volume>18</volume>
          (
          <issue>2</issue>
          ),
          <volume>185</volume>
          {
          <fpage>199</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>