-

Michal Ptaszynski Fumito Masui

f-masuig@cs.kitami-it.ac.jp fptaszynski,f-masuig@cs.kitami-it.ac.jp 0

Yasutomo Kimura

kimura@res.otaru-uc.ac.jp 1

Rafal Rzepka Kenji Araki

arakig@media.eng.hokudai.ac.jp fkabura,arakig@media.eng.hokudai.ac.jp 2 0 Department of Computer Science, Kitami Institute of Technology 1 Department of Information and Management , Science , Otaru University of Commerce 2 Graduate School of Information Science and Technology, Hokkaido University

2011

[Ptaszynski et al. 2010] performed affect analysis of small dataset of cyberbullying entries to find out that their distinctive features were vulgar words. They applied a lexicon of such words to train an SVM classifier. With a number of optimizations the system was able to detect cyberbullying with 88.2% of F-score. However, increasing the data caused a decrease in results, which made them conclude SVMs are not ideal in dealing with frequent language ambiguities typical for cyberbullying. Next, [Matsuba et al.2011] proposed a method to automatically detect harmful entries by extending the SO-PMI-IR score to calculate relevance of a document with harmful contents. With the use of a small number of seed words they were able to detect large numbers of candidates for harmful documents with an accuracy of 83%. Finally, [Nitta et al. 2013] proposed an improvement to Matsuba et al.’s method. They calculated SO-PMI-IR score for three categories of seed words (abusive, violent, obscene), and selected the one with the highest relevance. Their method achieved 90% of Precision for 10% Recall.

Most of the previous research assumed that using vulgar words as seeds will help detect cyberbullying. However, all of them notice that vulgar words are only one kind of distinctive vocabulary and do not cover all cases. We assumed such a vocabulary can be extracted automatically. Moreover, we did not restrict the scope to words, but extended the search to sophisticated patterns with disjoint elements. To achieve this we applied a pattern extraction method based on the idea of brute force search algorithm. 3

Method Description

We assumed that applying sophisticated patterns with disjoint elements should provide deeper insight than the usual bag-ofwords or n-gram approach. Such patterns, if defined as ordered combinations of sentence elements, could be extracted automatically. Algorithms using combinatorial approach usually generate a massive number of combinations - potential answers to a given problem. Thus they are often called bruteforce search algorithms. We assumed that optimizing the combinatorial algorithm to the problem requirements should make it advantageous in language processing task.

In the proposed method, firstly, ordered non-repeated combinations are generated from all elements of a sentence. In every n-element sentence there is k-number of combination clusters, such as that 1 k n, where k represents all kelement combinations being a subset of n. In this procedure all combinations for all values of k are generated. The number of all combinations is equal to the sum of all k-element combination clusters (see eq. 1).

n X n k = 1!(nn! 1)! + 2!(nn! 2)! + ::: + n!(nn! n)! = 2n k=1 Next, all non-subsequent elements are separated with an asterisk (“*”). Pattern occurrences O for each side of the dataset is used to calculate their normalized weight wj (eq. 2). The score of a sentence is calculated as a sum of weights of patterns found in the sentence (eq. 3). The weight can be further modified by: awarding pattern length k (LA), awarding length and occurrence O (LO).

The list of frequent patterns can be also further modified by: discarding ambiguous patterns which appear in the same number on both sides (harmful and non-harmful); later “zero patterns” (0P), as their weight is equal 0.

discarding ambiguous patterns of any ratio on both sides We also compared the performance of sophisticated patterns (PAT) to more common n-grams (NGR). 4

Evaluation Experiment Experiment Setup

In the evaluation we used a dataset created by [Matsuba et al.2011]. The dataset was also used by [Ptaszynski et al. 2010] and recently by [Nitta et al. 2013]. It contains 1,490 harmful and 1,508 non-harmful entries collected from unofficial school Web sites and manually labeled by Internet Patrol members according to instructions included in the manual for dealing with cyberbullying [MEXT 2008].

The dataset was further preprocessed in three ways: Tokenization: All words, punctuation marks, etc. are separated by spaces (TOK).

Parts of speech (POS): Words are replaced with their representative parts of speech (POS).

Tokens with POS: Both words and POS information is included in one element (POS+TOK).

We compared the performance for each kind of dataset preprocessing using a 10-fold cross validation and calculated the results using standard Precision, Recall and balanced F-score. There were several evaluation criteria. We checked which version of the algorithm achieves top scores within the threshold span. We also looked at break-even points (BEP) of Precision and Recall and checked the statistical significance of the results. We also compared the performance to the baselines [Matsuba et al.2011; Nitta et al. 2013].

Results and Discussion

Although highest occasional precision (P=.93) was achieved by POS feature set based on ngrams (NGR), its Recall and F-score were the lowest (R=.02, F=.78). Also high P with much higher R (P=.89, R=.34) was achieved by tokens with parts of speech based on either patterns or ngrams (TOK+POS/PATjNGR). This feature set also achieved the highest general F-score (F=.8). Tokenization with POS tagging also achieved the highest break-even point (BEP) (P=.79, R=.79). In most cases deleting ambiguous patterns yielded worse results, which suggests that such patterns, despite being ambiguous (appearing in both cyberbullying and non-cyberbullying entries), are in fact useful in practice.

Comparison with Previous Methods

In the comparison with previous methods we used the ones by [Matsuba et al.2011], and [Nitta et al. 2013]. Moreover, since the latter extracts cyberbullying relevance values from the Web, we also repeated their experiment to find out how the performance of the Web-based method changed during the two years since being proposed. Also, to make the comparison fair, we used our best and worst settings. As the evaluation metrics we used area under the curve (AUC) of 0 10 20 30 40 60 70 80 90

100 50

RECALL (%)

Precision and Recall (Fig. 1). The highest overall results were obtained by the best settings of the proposed method (TOK+POS/PAT). Although the highest score was still by [Nitta et al. 2013], their performance quickly decreases due to quick drop in Precision for higher thresholds. Moreover when we repeated their experiment in January 2015, the results greatly dropped. This could happed due to: (1) fluctuation in page rankings which pushed the information lower making it not extractable anymore; (2) frequent deletion requests of harmful contents by Internet Patrol members; (3) tightening of usage and privacy policies by most Web service providers. This advocates more focus on corpus-based methods such as the one proposed in this paper. 5

Conclusions

In this paper we proposed a method for automatic detection of cyberbullying – a recently noticed social problem influencing mental health of Internet users.

We applied a combinatorial algorithm in automatic extraction of sentence patterns, and used those patterns in text classification of CB entries. The evaluation experiment performed on actual CB data showed our method outperformed previous methods.

[Matsuba et al.2011]

Matsuba ,

Masui ,

Kawai ,

Isu . 2011 . A study on the polarity classification model for the purpose of detecting harmful information on informal school sites (in Japanese) , In Proceedings of NLP2011 , pp. 388 - 391 .

[MEXT 2008] Ministry of Education, Culture , Sports, Science and Technology (MEXT) . 2008 . “Bullying on the Net” Manual for handling and collection of cases (for schools and teachers) (in Japanese) . Published by MEXT.

[Nitta et al. 2013]

Nitta ,

Masui ,

Ptaszynski ,

Kimura ,

Rzepka ,

Araki . 2013 . Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization . In Proceedings of IJCNLP 2013 , pp. 579 - 586 .

[Ptaszynski et al. 2010 ] M. Ptaszynski , P.

Dybala , T.

Matsuba , F.

Masui , R.

Rzepka , K.

Araki , Y.

Momouchi . 2010 . In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis . IJCLR , Vol. 1 , Issue 3, pp. 135 - 154 .