<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michal Ptaszynski Fumito Masui</string-name>
          <email>f-masuig@cs.kitami-it.ac.jp</email>
          <email>fptaszynski,f-masuig@cs.kitami-it.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasutomo Kimura</string-name>
          <email>kimura@res.otaru-uc.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafal Rzepka Kenji Araki</string-name>
          <email>arakig@media.eng.hokudai.ac.jp</email>
          <email>fkabura,arakig@media.eng.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Kitami Institute of Technology</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information and Management</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>Otaru University of Commerce</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Graduate School of Information Science and Technology, Hokkaido University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>[Ptaszynski et al. 2010] performed affect analysis of small
dataset of cyberbullying entries to find out that their
distinctive features were vulgar words. They applied a lexicon of
such words to train an SVM classifier. With a number of
optimizations the system was able to detect cyberbullying with
88.2% of F-score. However, increasing the data caused a
decrease in results, which made them conclude SVMs are not
ideal in dealing with frequent language ambiguities typical
for cyberbullying. Next, [Matsuba et al.2011] proposed a
method to automatically detect harmful entries by extending
the SO-PMI-IR score to calculate relevance of a document
with harmful contents. With the use of a small number of
seed words they were able to detect large numbers of
candidates for harmful documents with an accuracy of 83%.
Finally, [Nitta et al. 2013] proposed an improvement to
Matsuba et al.’s method. They calculated SO-PMI-IR score for
three categories of seed words (abusive, violent, obscene),
and selected the one with the highest relevance. Their method
achieved 90% of Precision for 10% Recall.</p>
      <p>Most of the previous research assumed that using vulgar
words as seeds will help detect cyberbullying. However, all of
them notice that vulgar words are only one kind of distinctive
vocabulary and do not cover all cases. We assumed such a
vocabulary can be extracted automatically. Moreover, we did
not restrict the scope to words, but extended the search to
sophisticated patterns with disjoint elements. To achieve this
we applied a pattern extraction method based on the idea of
brute force search algorithm.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Method Description</title>
      <p>We assumed that applying sophisticated patterns with disjoint
elements should provide deeper insight than the usual
bag-ofwords or n-gram approach. Such patterns, if defined as
ordered combinations of sentence elements, could be extracted
automatically. Algorithms using combinatorial approach
usually generate a massive number of combinations - potential
answers to a given problem. Thus they are often called
bruteforce search algorithms. We assumed that optimizing the
combinatorial algorithm to the problem requirements should
make it advantageous in language processing task.</p>
      <p>In the proposed method, firstly, ordered non-repeated
combinations are generated from all elements of a sentence. In
every n-element sentence there is k-number of combination
clusters, such as that 1 k n, where k represents all
kelement combinations being a subset of n. In this procedure
all combinations for all values of k are generated. The
number of all combinations is equal to the sum of all k-element
combination clusters (see eq. 1).</p>
      <p>n
X n
k
= 1!(nn! 1)! + 2!(nn! 2)! + ::: + n!(nn! n)! = 2n
k=1
Next, all non-subsequent elements are separated with an
asterisk (“*”). Pattern occurrences O for each side of
the dataset is used to calculate their normalized weight
wj (eq. 2). The score of a sentence is calculated as a
sum of weights of patterns found in the sentence (eq. 3).
The weight can be further modified by:
awarding pattern length k (LA),
awarding length and occurrence O (LO).</p>
      <p>The list of frequent patterns can be also further modified by:
discarding ambiguous patterns which appear in the same
number on both sides (harmful and non-harmful); later
“zero patterns” (0P), as their weight is equal 0.</p>
      <p>discarding ambiguous patterns of any ratio on both sides
We also compared the performance of sophisticated patterns
(PAT) to more common n-grams (NGR).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Experiment</title>
      <sec id="sec-3-1">
        <title>Experiment Setup</title>
        <p>In the evaluation we used a dataset created by [Matsuba et
al.2011]. The dataset was also used by [Ptaszynski et al.
2010] and recently by [Nitta et al. 2013]. It contains 1,490
harmful and 1,508 non-harmful entries collected from
unofficial school Web sites and manually labeled by Internet Patrol
members according to instructions included in the manual for
dealing with cyberbullying [MEXT 2008].</p>
        <p>The dataset was further preprocessed in three ways:
Tokenization: All words, punctuation marks, etc. are
separated by spaces (TOK).</p>
        <p>Parts of speech (POS): Words are replaced with their
representative parts of speech (POS).</p>
        <p>Tokens with POS: Both words and POS information is
included in one element (POS+TOK).</p>
        <p>We compared the performance for each kind of dataset
preprocessing using a 10-fold cross validation and calculated the
results using standard Precision, Recall and balanced F-score.
There were several evaluation criteria. We checked which
version of the algorithm achieves top scores within the
threshold span. We also looked at break-even points (BEP) of
Precision and Recall and checked the statistical significance of the
results. We also compared the performance to the baselines
[Matsuba et al.2011; Nitta et al. 2013].</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results and Discussion</title>
        <p>Although highest occasional precision (P=.93) was achieved
by POS feature set based on ngrams (NGR), its Recall and
F-score were the lowest (R=.02, F=.78). Also high P with
much higher R (P=.89, R=.34) was achieved by tokens
with parts of speech based on either patterns or ngrams
(TOK+POS/PATjNGR). This feature set also achieved the
highest general F-score (F=.8). Tokenization with POS
tagging also achieved the highest break-even point (BEP)
(P=.79, R=.79). In most cases deleting ambiguous patterns
yielded worse results, which suggests that such patterns,
despite being ambiguous (appearing in both cyberbullying and
non-cyberbullying entries), are in fact useful in practice.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Comparison with Previous Methods</title>
        <p>In the comparison with previous methods we used the ones
by [Matsuba et al.2011], and [Nitta et al. 2013]. Moreover,
since the latter extracts cyberbullying relevance values from
the Web, we also repeated their experiment to find out how
the performance of the Web-based method changed during
the two years since being proposed. Also, to make the
comparison fair, we used our best and worst settings. As the
evaluation metrics we used area under the curve (AUC) of
0
10
20
30
40
60
70
80
90</p>
        <p>100
50</p>
        <p>RECALL (%)</p>
        <p>Precision and Recall (Fig. 1). The highest overall results
were obtained by the best settings of the proposed method
(TOK+POS/PAT). Although the highest score was still by
[Nitta et al. 2013], their performance quickly decreases due
to quick drop in Precision for higher thresholds. Moreover
when we repeated their experiment in January 2015, the
results greatly dropped. This could happed due to: (1)
fluctuation in page rankings which pushed the information lower
making it not extractable anymore; (2) frequent deletion
requests of harmful contents by Internet Patrol members; (3)
tightening of usage and privacy policies by most Web service
providers. This advocates more focus on corpus-based
methods such as the one proposed in this paper.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper we proposed a method for automatic detection of
cyberbullying – a recently noticed social problem influencing
mental health of Internet users.</p>
      <p>We applied a combinatorial algorithm in automatic
extraction of sentence patterns, and used those patterns in text
classification of CB entries. The evaluation experiment
performed on actual CB data showed our method outperformed
previous methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Matsuba et al.2011]
          <string-name>
            <given-names>T.</given-names>
            <surname>Matsuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Masui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kawai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Isu</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A study on the polarity classification model for the purpose of detecting harmful information on informal school sites (in Japanese)</article-title>
          ,
          <source>In Proceedings of NLP2011</source>
          , pp.
          <fpage>388</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [MEXT 2008]
          <article-title>Ministry of Education, Culture</article-title>
          , Sports,
          <source>Science and Technology (MEXT)</source>
          .
          <year>2008</year>
          .
          <article-title>“Bullying on the Net” Manual for handling and collection of cases (for schools and teachers) (in Japanese)</article-title>
          .
          <source>Published by MEXT.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Nitta et al. 2013]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nitta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Masui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ptaszynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rzepka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Araki</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization</article-title>
          .
          <source>In Proceedings of IJCNLP</source>
          <year>2013</year>
          , pp.
          <fpage>579</fpage>
          -
          <lpage>586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Ptaszynski et al. 2010
          <string-name>
            <surname>] M. Ptaszynski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dybala</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Matsuba</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Masui</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rzepka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Araki</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Momouchi</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis</article-title>
          .
          <source>IJCLR</source>
          , Vol.
          <volume>1</volume>
          , Issue 3, pp.
          <fpage>135</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>