Brute Force Works Best Against Bullying Michal Ptaszynski Fumito Masui Yasutomo Kimura Department of Computer Science, Department of Information and Management Kitami Institute of Technology Science, Otaru University of Commerce {ptaszynski,f-masui}@cs.kitami-it.ac.jp kimura@res.otaru-uc.ac.jp Rafal Rzepka Kenji Araki Graduate School of Information Science and Technology, Hokkaido University {kabura,araki}@media.eng.hokudai.ac.jp 1 Introduction seed words they were able to detect large numbers of candi- dates for harmful documents with an accuracy of 83%. Fi- The problem of harmful and offending messages on the In- nally, [Nitta et al. 2013] proposed an improvement to Mat- ternet has existed for many years. One of the reasons such suba et al.’s method. They calculated SO-PMI-IR score for activities evolved was the anonymity of communication on three categories of seed words (abusive, violent, obscene), the Internet, giving users the feeling that anything can go un- and selected the one with the highest relevance. Their method punished. Recently the problem has been officially defined achieved 90% of Precision for 10% Recall. and labeled as cyberbullying (CB)1 . Most of the previous research assumed that using vulgar In Japan the problem has become serious enough to be no- words as seeds will help detect cyberbullying. However, all of ticed by the Ministry of Education [MEXT 2008]. In 2007 them notice that vulgar words are only one kind of distinctive Japanese school personnel and members of Parent-Teacher vocabulary and do not cover all cases. We assumed such a Association (PTA) have started monitoring activities under vocabulary can be extracted automatically. Moreover, we did the general name Internet Patrol (later: net-patrol) to spot not restrict the scope to words, but extended the search to Web sites containing such inappropriate contents. However, sophisticated patterns with disjoint elements. To achieve this the net-patrol is performed manually as a volunteer work. we applied a pattern extraction method based on the idea of Countless amounts of data on the Internet make this an uphill brute force search algorithm. task. This situation motivated us to help and ease the bur- den of the net-patrol members and create a net-patrol crawler automatically spotting cyberbullying entries on the Web and 3 Method Description reporting them to appropriate organs. We assumed that applying sophisticated patterns with disjoint In the following sections we firstly present some of the pre- elements should provide deeper insight than the usual bag-of- vious research related to ours. Next, we describe our method words or n-gram approach. Such patterns, if defined as or- and the dataset used in this research. In the method we ap- dered combinations of sentence elements, could be extracted ply a combinatorial approach to language modeling, resem- automatically. Algorithms using combinatorial approach usu- bling brute force search algorithms, to extract sophisticated ally generate a massive number of combinations - potential patterns from sentences. Next, we use them in text classifica- answers to a given problem. Thus they are often called brute- tion task. Finally, we explain the evaluation settings, analyze force search algorithms. We assumed that optimizing the and discuss the results. Evaluation on actual cyberbullying combinatorial algorithm to the problem requirements should data showed our method outperformed previous ones while make it advantageous in language processing task. minimizing human effort. In the proposed method, firstly, ordered non-repeated com- binations are generated from all elements of a sentence. In every n-element sentence there is k-number of combination 2 Previous Research clusters, such as that 1 ≤ k ≤ n, where k represents all k- element combinations being a subset of n. In this procedure [Ptaszynski et al. 2010] performed affect analysis of small all combinations for all values of k are generated. The num- dataset of cyberbullying entries to find out that their distinc- ber of all combinations is equal to the sum of all k-element tive features were vulgar words. They applied a lexicon of combination clusters (see eq. 1). such words to train an SVM classifier. With a number of op- n n n! n! n! X timizations the system was able to detect cyberbullying with = + n + ... + =2 −1 (1) k 1!(n − 1)! 2!(n − 2)! n!(n − n)! 88.2% of F-score. However, increasing the data caused a de- k=1 crease in results, which made them conclude SVMs are not Next, all non-subsequent elements are separated with an ideal in dealing with frequent language ambiguities typical asterisk (“*”). Pattern occurrences O for each side of for cyberbullying. Next, [Matsuba et al.2011] proposed a the dataset is used to calculate their normalized weight method to automatically detect harmful entries by extending wj (eq. 2). The score of a sentence is calculated as a the SO-PMI-IR score to calculate relevance of a document sum of weights of patterns found in the sentence (eq. 3). with harmful contents. With the use of a small number of   Opos P 1 http://www.ncpc.org/cyberbullying wj = −0.5 ∗2 (2) score = wj , (1 ≥ wj ≥ −1) (3) Opos + Oneg The weight can be further modified by: 100 • awarding pattern length k (LA), 90 • awarding length and occurrence O (LO). 80 The list of frequent patterns can be also further modified by: 70 • discarding ambiguous patterns which appear in the same 60 PRECISION (%) number on both sides (harmful and non-harmful); later 50 “zero patterns” (0P), as their weight is equal 0. 40 • discarding ambiguous patterns of any ratio on both sides Matsuba et al. 2011 30 We also compared the performance of sophisticated patterns Nitta et al. 2013 20 Nitta et al. repeated in 2015 (PAT) to more common n-grams (NGR). Proposed (worst) 10 Proposed (best) 4 Evaluation Experiment 0 0 10 20 30 40 50 60 70 80 90 100 RECALL (%) Experiment Setup In the evaluation we used a dataset created by [Matsuba et Figure 1: Comparison between the proposed method (best al.2011]. The dataset was also used by [Ptaszynski et al. and worst performance) and previous methods. 2010] and recently by [Nitta et al. 2013]. It contains 1,490 harmful and 1,508 non-harmful entries collected from unoffi- cial school Web sites and manually labeled by Internet Patrol Precision and Recall (Fig. 1). The highest overall results members according to instructions included in the manual for were obtained by the best settings of the proposed method dealing with cyberbullying [MEXT 2008]. (TOK+POS/PAT). Although the highest score was still by The dataset was further preprocessed in three ways: [Nitta et al. 2013], their performance quickly decreases due • Tokenization: All words, punctuation marks, etc. are to quick drop in Precision for higher thresholds. Moreover separated by spaces (TOK). when we repeated their experiment in January 2015, the re- • Parts of speech (POS): Words are replaced with their sults greatly dropped. This could happed due to: (1) fluctu- representative parts of speech (POS). ation in page rankings which pushed the information lower • Tokens with POS: Both words and POS information is making it not extractable anymore; (2) frequent deletion re- included in one element (POS+TOK). quests of harmful contents by Internet Patrol members; (3) We compared the performance for each kind of dataset pre- tightening of usage and privacy policies by most Web service processing using a 10-fold cross validation and calculated the providers. This advocates more focus on corpus-based meth- results using standard Precision, Recall and balanced F-score. ods such as the one proposed in this paper. There were several evaluation criteria. We checked which version of the algorithm achieves top scores within the thresh- 5 Conclusions old span. We also looked at break-even points (BEP) of Preci- sion and Recall and checked the statistical significance of the In this paper we proposed a method for automatic detection of results. We also compared the performance to the baselines cyberbullying – a recently noticed social problem influencing [Matsuba et al.2011; Nitta et al. 2013]. mental health of Internet users. We applied a combinatorial algorithm in automatic ex- Results and Discussion traction of sentence patterns, and used those patterns in text Although highest occasional precision (P=.93) was achieved classification of CB entries. The evaluation experiment per- by POS feature set based on ngrams (NGR), its Recall and formed on actual CB data showed our method outperformed F-score were the lowest (R=.02, F=.78). Also high P with previous methods. much higher R (P=.89, R=.34) was achieved by tokens with parts of speech based on either patterns or ngrams References (TOK+POS/PAT|NGR). This feature set also achieved the highest general F-score (F=.8). Tokenization with POS [Matsuba et al.2011] T. Matsuba, F. Masui, A. Kawai, N. Isu. 2011. A study on the polarity classification model for the purpose tagging also achieved the highest break-even point (BEP) of detecting harmful information on informal school sites (in (P=.79, R=.79). In most cases deleting ambiguous patterns Japanese), In Proceedings of NLP2011, pp. 388-391. yielded worse results, which suggests that such patterns, de- spite being ambiguous (appearing in both cyberbullying and [MEXT 2008] Ministry of Education, Culture, Sports, Science and non-cyberbullying entries), are in fact useful in practice. Technology (MEXT). 2008. “Bullying on the Net” Manual for handling and collection of cases (for schools and teachers) (in Comparison with Previous Methods Japanese). Published by MEXT. In the comparison with previous methods we used the ones [Nitta et al. 2013] T. Nitta, F. Masui, M. Ptaszynski, Y. Kimura, by [Matsuba et al.2011], and [Nitta et al. 2013]. Moreover, R. Rzepka, K. Araki. 2013. Detecting Cyberbullying Entries on since the latter extracts cyberbullying relevance values from Informal School Websites Based on Category Relevance Maxi- the Web, we also repeated their experiment to find out how mization. In Proceedings of IJCNLP 2013, pp. 579-586. the performance of the Web-based method changed during [Ptaszynski et al. 2010] M. Ptaszynski, P. Dybala, T. Matsuba, F. the two years since being proposed. Also, to make the com- Masui, R. Rzepka, K. Araki, Y. Momouchi. 2010. In the Service parison fair, we used our best and worst settings. As the of Online Order: Tackling Cyber-Bullying with Machine Learn- evaluation metrics we used area under the curve (AUC) of ing and Affect Analysis. IJCLR, Vol. 1, Issue 3, pp. 135-154.