<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection of Social Network Toxic Comments with Usage of Syntactic Dependencies in the Sentences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>rhiy Shtov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shtov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vinnytsia National Technical University</institution>
          ,
          <addr-line>Khmelnytske Shose, 95, Vinnytsia, 21021</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Social networks sometimes become a medium for threats, insults and other components of cyberbullying. A huge number of people are involved in online social networks. Hence, a protection of network users from anti-social behavior is an important activity. One of the major tasks of such activity is automated detecting the toxic comments with threats, insults, obscene etc. The bag of words statistics and bag of symbols statistics are typical features for the toxic comments detection. The effect of syntactic dependencies in sentences on the quality of detection of the social network toxic comments is studied in the article for the first time. Syntactic dependences are relationships with proper nouns, personal pronouns, possessive pronouns, etc. Twenty syntactic features of sentences have been verified in the total. The paper shows that 3 additional specific features significantly improve the quality of toxic comments detection. These three features are: the number of dependences with proper nouns in the singular, the number of dependences that contain bad words, and the number of dependences between personal pronouns and bad words. The experiments are based on data from kaggle competition "Toxic Comment Classification Challenge". For our experiments, the original dataset with 159751 comments was reduced to 106590 comments due to problems with human-free extraction of the syntactic features. We use mean of the error rates for each types of misclassification as the metric of quality due to unbalanced dataset. A decision tree is used as a classifier. The decision trees were synthesized for two splitting rules: Gini index and deviance criterion.</p>
      </abstract>
      <kwd-group>
        <kwd>natural language processing</kwd>
        <kwd>syntactic dependencies</kwd>
        <kwd>toxic comments</kwd>
        <kwd>social network</kwd>
        <kwd>machine learning</kwd>
        <kwd>features selection</kwd>
        <kwd>balanced accuracy</kwd>
        <kwd>decision tree</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ments. Toxic comments are textual comments with threats, insults, obscene, racism
etc.</p>
      <p>
        The various techniques are used for human-free detecting the toxic comments. Bag
of words statistics and bag of symbols statistics are typical source information for the
toxic comments detection. Usually the following statistics-based features are used:
length of the comment, number of capital letters, number of exclamation marks,
number of question marks, number of spelling errors, number of tokens with non-alphabet
symbols, number of abusive, aggressive, and threatening words in the comment, etc.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. High count of bad words in the comment increases a chance to classify it as toxic.
However, there are some difficulties with usage of the bad words statistics. Some
outof-vocabulary words are produced by typos and by spelling errors. Often authors of
toxic comments distort their bad words purposely. They convert the bad words to
phonetically identical forms by replacing letter combinations oo to u, for to 4, too to 2
etc. Another variant is to distort to visual similar forms, for example, 5h1t, b!tch,
b1tch. Scientists develop special technologies for detecting the masked bad words [
        <xref ref-type="bibr" rid="ref2 ref3">2,
3</xref>
        ], but vandals have a reserve in time and in persons. In addition to analyzing the
separated keywords, some methods take into account the order of the words in
sentences. For example, authors of [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] used n-grams-based approach, but such
modeling does not reflect the whole relations in sentences.
      </p>
      <p>
        The aim of the paper is to study an effect of syntactic dependencies in sentences on
the quality of detecting the social network toxic comments. Syntactic dependences are
relationships with proper nouns, personal pronouns, possessive pronouns, etc.
Opposite to n-gram method and naive Bayesian approach, the model based on the syntactic
dependencies does not directly tie with the training set vocabulary. All the various
proper names, personal pronouns, possessive pronouns are allocated into separate
groups. It allows to use the vocabulary-free generalized features in the model.
Another instance from this group in the test set will not affect the simulation negatively.
We use the information technology from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for extraction the syntactic features from
the data set. We compare the results of toxic comments detection on two sets of
features. The first set is typical features that based on bag of words statistics and bag of
symbols statistics. The second one is extended set that contains typical features
together with syntactic features. The experiments are performed on the “Toxic
Comment Classification Challenge” data set.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data sets and preprocessing</title>
      <p>
        Data set “Toxic Comment Classification Challenge” is collected by Conversation AI
team, a research initiative founded by Jigsaw and Google, both a part of Alphabet.
The data set is used in kaggle-competition [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The data set consists of 159751
Wikipedia comments which have been labeled by human raters for toxic behavior.
Most of the comments are English [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Each comment is manually categorized with 6 binary labels: toxic, severe toxic,
obscene, threat, insult, and identity hate. Some comments have toxic multiplicity. In
this case a comment belongs to 2, 3, and even 6 toxic categories simultaneously
(Figure 1). Also a comment may be neutral, i.e. it does not belong to any toxic category.
For example, the following comment “Your vandalism to the Matt Shirvington article
has been reverted. Please don't do it again, or you will be banned.” is neutral.
Comment “Hi! I am back again! Last warning! Stop undoing my edits or die!” is toxic and
threated, and comment “Would you both shut up, you don't run Wikipedia, especially
a stupid kid.” is toxic and insult.
16225 comments have the toxic labels. The rest of the comments are neutral. A
distribution of the comments on toxic multiplicities is presented on Figure 2. It shows
that only comments with high toxicity multiplicity are rarely encountered. Most of
toxic comments (60.8%) belong to several toxic categories (m&gt;1).
gory – the blue square is completely inside the red square. Also, almost all the severe
toxic comments are obscene and insult. There are 3 very low intersecting categories:
severe toxic, threat, and identity hate. Few comments belong simultaneously to two
out these three categories. Figure 3 also shows the degree of similarity for two finite
sets in form of Jaccard index (kj). It is calculated as the cardinality of the intersection
of the sets divided by the cardinality of the union of the sets. For our case Jaccard
index corresponds to the ratio of the area of intersection of two squares over the area
of the union of two squares.</p>
      <p>
        We propose to add several specific features to the typical feature set that based on
statistics of a bag of the words and statistics of a bag of the symbols. The specific
features are taking into account some syntax dependencies between words in
comment. The specific features extraction was done using the technology from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
specific features were extracted automatically for 106590 comments. Features
extraction for some comments was unsuccessful due to non-English text and
out-ofvocabulary words. As a result, the modified data set consists 66.8% of the source data
set. Neutral comments compose 87.2% of the modified data set. It is slightly less than
in the source data set where the neutral ratio is 89.8%. Distributions of the comments
on toxic categories are almost equal for two data sets (Table 1).
      </p>
    </sec>
    <sec id="sec-3">
      <title>Features and quality metric</title>
      <p>x22 is a number of the comment’s words that included into facebook black list at
https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-touse-facebooks-moderation-tool/;</p>
      <p>x23 is a number of the comment’s words that included into google blacklist at
https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/;
x24 is a number of the comment’s words that included into the naughty word list
at https://gist.github.com/ryanlewis/a37739d710ccdb4b406d;
x25 is a number of the comment’s words that included into 5 mentioned lists;
x26 is a number of dependencies with proper nouns in the singular;
x27 is a number of dependencies with proper nouns in the plural;
x28 is a number of dependencies with personal pronouns;
x29 is a number of dependencies with possessive pronouns;
x30 is a number of dependencies with denial (with words never or not);
x31 is a number of dependencies with denial that contain proper nouns in the
singular;</p>
      <p>x32 is a number of dependencies with denial that contain proper nouns in the
plural;
x33 is a number of dependencies with denial that contain personal pronouns;
x34 is a number of dependencies with denial that contain possessive pronouns;
x35 is a number of dependencies between proper nouns in the singular and the
words from dependencies with denial;</p>
      <p>x36 is a number of dependencies between proper nouns in the plural and the words
from dependencies with denial;</p>
      <p>x37 is a number of dependencies between personal pronouns and the words from
dependencies with denial;</p>
      <p>x38 is a number of dependencies between possessive pronouns and the words
from dependencies with denial;
x39 is a number of dependencies that contain the bad words;
x40 is a number of dependencies with denial that contain the bad words;
x41 is a number of dependencies between proper nouns in the singular and the bad
words;</p>
      <p>x42 is a number of dependencies between proper nouns in the plural and the bad
words;
x43 is a number of dependencies between personal pronouns and the bad words;
x44 is a number of dependencies between possessive pronouns and the bad words;
x45 is a number of dependencies between pronouns and the bad words.</p>
      <p>Twenty specific features x26 - x45 are examined for toxic comments detection for
the first time. Let us modify the original kaggle-task of categorizing the toxic
comments to the classification one with two alternatives: a neutral comment and a general
toxic comment. It allows to easy checkup the informative levels of the proposed
syntactic features.</p>
      <p>
        The data set is unbalanced with class proportion about 9 to 1. Hence,
misclassification rate is not suitable metric for quality of the classifier. According to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] we use
balanced accuracy approach. The metric of quality of the classifier is as follows:
where Pnt denotes probability of n→t type classifying errors, when a neutral
comment is recognized as a general toxic comment; Ptn denotes probability of t→n type
classifying errors, when a general toxic comment is recognized as a neutral comment.
Qaver is mean of probabilities of each type misclassification. It is simple and
interpretable metric for examination a classifier on unbalanced data set.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Computational experiments</title>
      <p>A decision tree is used as a classifier. We choose this kind of classifier taking into
account the following reasons: 1) a synthesis of the decision tree is a fast procedure
even for large training set, hence, it is possible to carry out several experiments; 2)
features selection is carried during the decision tree synthesis; it is easy to check the
informative levels of the proposed syntactic features. We divide the data set on
training data and test data. The test set consists of every sixth comment. The rest
comments are in the training set. Thus, the test set contains 17765 comments and training
set contains 88825 comments. We use the training data for decision tree synthesis.
After this, the decision tree is pruned for minimization Qaver on the test set. We
check up two sets of the features: typical set – x1 - x25 and extended set - x1 - x44 .</p>
      <p>Rebalancing the class distribution is yielded by a sampling in way of increasing the
weight of minor class objects. We suppose that correct classification of the comment
with high toxic multiplicity is more important than the comment with low toxic
multiplicity. Weight w of toxic comment C is defined by the following heuristic
formula:
w(C)  b 
m(C) ,
where b denotes a bias of toxic comment weight; m(C) {1, 2, ..., 6} denotes toxic
multiplicity of comment C .</p>
      <p>Figure 4 shows the dependences of the classifier quality under the bias of toxic
comment weight. The decision trees were synthesized with two splitting rules: Gini
index-based rule and deviance criterion-based rule. The experiments show that Gini
index-based rule provides better decision trees. Qaver is low, when the bias of toxic
comment weight belongs to [4.5, 5.8]. Minimal value of Qaver =0.118 is obtained for
b [5.2, 5.5] . Figure 4 shows that the extended set of features significantly improves
the classifier quality.</p>
      <p>The best model is a decision tree with minimal value of Qaver .The best decision
tree is presented on Figure 5. Misclassification rate for the best decision tree is
Q=0.0987. The other metrics for the best decision tree are as follows: Qaver =0.118,
Pnt  0.0919 , and Ptn  0.1442 . The best tree correctly detects almost the all
comments with high and average toxic multiplicities (Figure 6). The best tree correctly
detects almost all the toxic comments with labels severe toxic, obscene, and identity
hate (Figure 7).</p>
      <p>Let us analyze 5 best trees. All the trees use the following features: x3 - x9 , x15 ,
x17 - x19 , x22 , x24 - x26 , x39 , and x43 . 4 out 5 trees use feature x1 additionally.
Among their most important features are 3 new syntactic ones: a number of
dependencies with proper nouns in the singular ( x26 ); a number of dependencies that
contain the bad words ( x39 ) and a number of dependencies between personal pronouns
and the bad words ( x43 ).</p>
      <p>We also point to 4 following slightly less important features. Typical features x2 ,
x10 , and x12 are in 2 out 5 the best trees. Syntactic feature x28 is selected for 1 out 5
the best trees. The mentioned 4 extra features may be used for more complicated
models for toxic comment detection.
The problem of detecting the toxic comments in social networks was considered. For
our experiments we used kaggle data set "Toxic Comment Classification Challenge".
The bag of words statistics and bag of symbols statistics are typical features for
detecting the toxic comments. The effect of syntactic dependencies in sentences on the
quality of the social network toxic comments detection was studied in the article.
Syntactic dependences are relationships with proper nouns, personal pronouns,
possessive pronouns, etc. In total 20 syntactic features of sentences had been checked.</p>
      <p>A novelty of the research consists of the experimental confirmation that 3
additional specific features significantly improve the quality of toxic comments detection.
Those three features are: the number of dependences with proper nouns in the
singular, the number of dependences that contain bad words, and the number of
dependences between personal pronouns and bad words. The selection of 3 specific features
allows to significantly reduces the computational complexity of text comment
preprocessing, since the calculation of all 20 specific features requires a lot of resources.
Accordingly, with 3 specific features added to the typical set, the identification of the
toxic comments can be done in real time with good quality.</p>
      <p>Acknowledgements. Authors thank Olexandr Yahimovych for extraction the
syntactic features from the data set of toxic comments. This research is supported by
government scientific project 46–G–388 «Fuzzy logic and computational linguistics
based the identification of hidden dependencies in online social networks».</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Salminen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Anatomy of online hate: developing a taxonomy and machine learning models for identifying and classifying hate in online news media</article-title>
          .
          <source>In: Proceeding of the Twelfth International AAAI Conference on Web and Social Media</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>330</fpage>
          -
          <lpage>339</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khurana</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tewari</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Identifying Aggression and Toxicity in Comments using Capsule Network</article-title>
          .
          <source>In: Proceeding of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)</source>
          , pp.
          <fpage>98</fpage>
          -
          <lpage>105</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sood</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Churchill</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          :
          <article-title>Using Crowdsourcing to Improve Profanity Detection</article-title>
          .
          <source>In: Proceeding of Association for the Advancement of Artificial Intelligence. Spring Symposium: Wisdom of the Crowd</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>74</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Is preprocessing of text really worth your time for toxic comment classification?</article-title>
          <source>In: Proceeding of International Conference on Artificial Intelligence</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>447</fpage>
          -
          <lpage>453</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Warner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirschberg</surname>
          </string-name>
          , J.:
          <article-title>Detecting hate speech on the world wide web</article-title>
          .
          <source>In Proceedings of the Second Workshop on Language in Social Media. Association for Computational Linguistics</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bisikalo</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yahimovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yahimovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Development of the method for filtering verbal noise while search keywords for the English text</article-title>
          .
          <source>Technology Audit and Production Reserves</source>
          .
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <fpage>33</fpage>
          -
          <lpage>41</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Toxic</given-names>
            <surname>Comment Classification Challenge</surname>
          </string-name>
          . Available: https://www.kaggle.com/c/jigsawtoxic-comment
          <article-title>-classification-challenge.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Elnaggar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.:
          <article-title>Stop Illegal Comments: A Multi-Task Deep Learning Approach</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>06665</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Brodersen</surname>
            ,
            <given-names>K.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ong</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stephan</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buhmann</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>The balanced accuracy and its posterior distribution</article-title>
          .
          <source>In Proceedings of the 20th IEEE International Conference on Pattern Recognition</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>3121</fpage>
          -
          <lpage>3124</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>