Brute Force Works Best Against Bullying
                Michal Ptaszynski        Fumito Masui                                    Yasutomo Kimura
                  Department of Computer Science,                            Department of Information and Management
                   Kitami Institute of Technology                              Science, Otaru University of Commerce
              {ptaszynski,f-masui}@cs.kitami-it.ac.jp                                       kimura@res.otaru-uc.ac.jp

                                          Rafal Rzepka       Kenji Araki
                    Graduate School of Information Science and Technology, Hokkaido University
                                            {kabura,araki}@media.eng.hokudai.ac.jp
1       Introduction                                               seed words they were able to detect large numbers of candi-
                                                                   dates for harmful documents with an accuracy of 83%. Fi-
The problem of harmful and offending messages on the In-
                                                                   nally, [Nitta et al. 2013] proposed an improvement to Mat-
ternet has existed for many years. One of the reasons such
                                                                   suba et al.’s method. They calculated SO-PMI-IR score for
activities evolved was the anonymity of communication on
                                                                   three categories of seed words (abusive, violent, obscene),
the Internet, giving users the feeling that anything can go un-
                                                                   and selected the one with the highest relevance. Their method
punished. Recently the problem has been officially defined
                                                                   achieved 90% of Precision for 10% Recall.
and labeled as cyberbullying (CB)1 .
                                                                      Most of the previous research assumed that using vulgar
   In Japan the problem has become serious enough to be no-        words as seeds will help detect cyberbullying. However, all of
ticed by the Ministry of Education [MEXT 2008]. In 2007            them notice that vulgar words are only one kind of distinctive
Japanese school personnel and members of Parent-Teacher            vocabulary and do not cover all cases. We assumed such a
Association (PTA) have started monitoring activities under         vocabulary can be extracted automatically. Moreover, we did
the general name Internet Patrol (later: net-patrol) to spot       not restrict the scope to words, but extended the search to
Web sites containing such inappropriate contents. However,         sophisticated patterns with disjoint elements. To achieve this
the net-patrol is performed manually as a volunteer work.          we applied a pattern extraction method based on the idea of
Countless amounts of data on the Internet make this an uphill      brute force search algorithm.
task. This situation motivated us to help and ease the bur-
den of the net-patrol members and create a net-patrol crawler
automatically spotting cyberbullying entries on the Web and
                                                                   3     Method Description
reporting them to appropriate organs.                              We assumed that applying sophisticated patterns with disjoint
   In the following sections we firstly present some of the pre-   elements should provide deeper insight than the usual bag-of-
vious research related to ours. Next, we describe our method       words or n-gram approach. Such patterns, if defined as or-
and the dataset used in this research. In the method we ap-        dered combinations of sentence elements, could be extracted
ply a combinatorial approach to language modeling, resem-          automatically. Algorithms using combinatorial approach usu-
bling brute force search algorithms, to extract sophisticated      ally generate a massive number of combinations - potential
patterns from sentences. Next, we use them in text classifica-     answers to a given problem. Thus they are often called brute-
tion task. Finally, we explain the evaluation settings, analyze    force search algorithms. We assumed that optimizing the
and discuss the results. Evaluation on actual cyberbullying        combinatorial algorithm to the problem requirements should
data showed our method outperformed previous ones while            make it advantageous in language processing task.
minimizing human effort.                                              In the proposed method, firstly, ordered non-repeated com-
                                                                   binations are generated from all elements of a sentence. In
                                                                   every n-element sentence there is k-number of combination
2       Previous Research                                          clusters, such as that 1 ≤ k ≤ n, where k represents all k-
                                                                   element combinations being a subset of n. In this procedure
[Ptaszynski et al. 2010] performed affect analysis of small        all combinations for all values of k are generated. The num-
dataset of cyberbullying entries to find out that their distinc-   ber of all combinations is equal to the sum of all k-element
tive features were vulgar words. They applied a lexicon of         combination clusters (see eq. 1).
such words to train an SVM classifier. With a number of op-            n
                                                                           n      n!          n!               n!
                                                                     X
timizations the system was able to detect cyberbullying with                     =                +
                                                                                                                        n
                                                                                                                   + ... +                =2 −1   (1)
                                                                             k       1!(n − 1)!       2!(n − 2)!             n!(n − n)!
88.2% of F-score. However, increasing the data caused a de-            k=1
crease in results, which made them conclude SVMs are not
                                                                    Next, all non-subsequent elements are separated with an
ideal in dealing with frequent language ambiguities typical
                                                                   asterisk (“*”). Pattern occurrences O for each side of
for cyberbullying. Next, [Matsuba et al.2011] proposed a
                                                                   the dataset is used to calculate their normalized weight
method to automatically detect harmful entries by extending
                                                                   wj (eq. 2). The score of a sentence is calculated as a
the SO-PMI-IR score to calculate relevance of a document
                                                                   sum of weights of patterns found in the sentence (eq. 3).
with harmful contents. With the use of a small number of
                                                                                        
                                                                                Opos                    P
    1
        http://www.ncpc.org/cyberbullying                           wj =                 −0.5 ∗2 (2) score =                   wj , (1 ≥ wj ≥ −1) (3)
                                                                             Opos + Oneg
The weight can be further modified by:                                                100
  • awarding pattern length k (LA),                                                    90
  • awarding length and occurrence O (LO).                                             80
The list of frequent patterns can be also further modified by:                         70
  • discarding ambiguous patterns which appear in the same
                                                                                       60


                                                                      PRECISION (%)
     number on both sides (harmful and non-harmful); later
                                                                                       50
     “zero patterns” (0P), as their weight is equal 0.
                                                                                       40
  • discarding ambiguous patterns of any ratio on both sides                                                                          Matsuba et al. 2011
                                                                                       30
We also compared the performance of sophisticated patterns                                                                            Nitta et al. 2013

                                                                                       20                                             Nitta et al. repeated in 2015
(PAT) to more common n-grams (NGR).                                                                                                   Proposed (worst)
                                                                                       10
                                                                                                                                      Proposed (best)


4   Evaluation Experiment                                                              0
                                                                                            0   10   20   30   40      50        60      70           80         90   100
                                                                                                                    RECALL (%)

Experiment Setup
In the evaluation we used a dataset created by [Matsuba et        Figure 1: Comparison between the proposed method (best
al.2011]. The dataset was also used by [Ptaszynski et al.         and worst performance) and previous methods.
2010] and recently by [Nitta et al. 2013]. It contains 1,490
harmful and 1,508 non-harmful entries collected from unoffi-
cial school Web sites and manually labeled by Internet Patrol     Precision and Recall (Fig. 1). The highest overall results
members according to instructions included in the manual for      were obtained by the best settings of the proposed method
dealing with cyberbullying [MEXT 2008].                           (TOK+POS/PAT). Although the highest score was still by
   The dataset was further preprocessed in three ways:            [Nitta et al. 2013], their performance quickly decreases due
   • Tokenization: All words, punctuation marks, etc. are         to quick drop in Precision for higher thresholds. Moreover
      separated by spaces (TOK).                                  when we repeated their experiment in January 2015, the re-
   • Parts of speech (POS): Words are replaced with their         sults greatly dropped. This could happed due to: (1) fluctu-
      representative parts of speech (POS).                       ation in page rankings which pushed the information lower
   • Tokens with POS: Both words and POS information is           making it not extractable anymore; (2) frequent deletion re-
      included in one element (POS+TOK).                          quests of harmful contents by Internet Patrol members; (3)
   We compared the performance for each kind of dataset pre-      tightening of usage and privacy policies by most Web service
processing using a 10-fold cross validation and calculated the    providers. This advocates more focus on corpus-based meth-
results using standard Precision, Recall and balanced F-score.    ods such as the one proposed in this paper.
There were several evaluation criteria. We checked which
version of the algorithm achieves top scores within the thresh-   5                   Conclusions
old span. We also looked at break-even points (BEP) of Preci-
sion and Recall and checked the statistical significance of the   In this paper we proposed a method for automatic detection of
results. We also compared the performance to the baselines        cyberbullying – a recently noticed social problem influencing
[Matsuba et al.2011; Nitta et al. 2013].                          mental health of Internet users.
                                                                     We applied a combinatorial algorithm in automatic ex-
Results and Discussion                                            traction of sentence patterns, and used those patterns in text
Although highest occasional precision (P=.93) was achieved        classification of CB entries. The evaluation experiment per-
by POS feature set based on ngrams (NGR), its Recall and          formed on actual CB data showed our method outperformed
F-score were the lowest (R=.02, F=.78). Also high P with          previous methods.
much higher R (P=.89, R=.34) was achieved by tokens
with parts of speech based on either patterns or ngrams           References
(TOK+POS/PAT|NGR). This feature set also achieved the
highest general F-score (F=.8). Tokenization with POS             [Matsuba et al.2011] T. Matsuba, F. Masui, A. Kawai, N. Isu. 2011.
                                                                    A study on the polarity classification model for the purpose
tagging also achieved the highest break-even point (BEP)
                                                                    of detecting harmful information on informal school sites (in
(P=.79, R=.79). In most cases deleting ambiguous patterns           Japanese), In Proceedings of NLP2011, pp. 388-391.
yielded worse results, which suggests that such patterns, de-
spite being ambiguous (appearing in both cyberbullying and        [MEXT 2008] Ministry of Education, Culture, Sports, Science and
non-cyberbullying entries), are in fact useful in practice.         Technology (MEXT). 2008. “Bullying on the Net” Manual for
                                                                    handling and collection of cases (for schools and teachers) (in
Comparison with Previous Methods                                    Japanese). Published by MEXT.
In the comparison with previous methods we used the ones          [Nitta et al. 2013] T. Nitta, F. Masui, M. Ptaszynski, Y. Kimura,
by [Matsuba et al.2011], and [Nitta et al. 2013]. Moreover,          R. Rzepka, K. Araki. 2013. Detecting Cyberbullying Entries on
since the latter extracts cyberbullying relevance values from        Informal School Websites Based on Category Relevance Maxi-
the Web, we also repeated their experiment to find out how           mization. In Proceedings of IJCNLP 2013, pp. 579-586.
the performance of the Web-based method changed during            [Ptaszynski et al. 2010] M. Ptaszynski, P. Dybala, T. Matsuba, F.
the two years since being proposed. Also, to make the com-           Masui, R. Rzepka, K. Araki, Y. Momouchi. 2010. In the Service
parison fair, we used our best and worst settings. As the            of Online Order: Tackling Cyber-Bullying with Machine Learn-
evaluation metrics we used area under the curve (AUC) of          ing and Affect Analysis. IJCLR, Vol. 1, Issue 3, pp. 135-154.