Learning Deep on Cyberbullying is Always Better Than Brute Force Michal Ptaszynski†, Juuso Kalevi Kristian Eronen‡† and Fumito Masui† †Kitami Institute of Technology, Kitami, Japan ‡Tampere University of Technology, Tampere, Finland {ptaszynski,f-masui}@cs.kitami-it.ac.jp juuso.eronen@student.tut.fi Abstract 2008]. As one of the ways to deal with the problem school personnel have started Internet Patrol (IP) to detect Web fo- In this paper we present our research on detection rum sites and SNS containing cyberbullying contents. Unfor- of cyberbullying (CB), which stands for humiliat- tunately, as IP is performed manually, reading through count- ing other people through the Internet. CB has be- less amounts of Websites makes it an uphill struggle. come recognized as a social problem, and its mostly Some research have started developing methods for auto- juvenile victims usually fall into depression, self- matic detection of CB to help in this struggle [Ptaszynski mutilate, or even commit suicide. To deal with the et al., 2010; Dinakar et al., 2012; Ptaszynski et al., 2016]. problem, school personnel performs Internet Patrol Unfortunately, even with multiple improvements, the results (IP) by reading through the available Web contents have remained merely partially satisfying. This is caused by to spot harmful entries. It is crucial to help IP mem- a multitude of language ambiguities and styles used in CB. bers detect malicious contents more efficiently. A In this paper we propose a novel, Convolutional Neural number of research has tackled the problem dur- Networks (CNN) approach to automatic cyberbullying detec- ing recent years. However, due to complexity of tion. Moreover, based on the analysis of the characteristics language used in cyberbullying, the results has re- of CNN and the initial results, we propose an optimization of mained only mildly satisfying. We propose a novel CNN by increasing Feature Density of training data. method to automatic cyberbullying detection based on Convolutional Neural Networks and increased The rest of the paper is organized int he following way. Feature Density. The experiments performed on ac- Firstly, we describe the problem of cyberbullying and present tual cyberbullying data showed a major advantage some of the previous research related to ours. Next, we de- of our approach to all previous methods, including scribe the proposed method and other methods used for com- the best performing method so far based on Brute- parison. Further, we present the dataset used in this research, Force Search algorithm. and explain the evaluation settings, followed by analysis of experiment results and discussion. 1 Introduction 2 Research Background Recent years brought to light the problem of cyberbullying (CB), defined as exploitation of open online means of com- 2.1 Cyberbullying: Description of a Problem munication, such as Internet forum boards, or social network The choice of media used in communication can cause in- services (SNS) to convey harmful and disturbing information creased psychological distance between interlocutors [Rutter, about private individuals, often children and students [Patchin 1987], which can lead to empathy deficit, especially in Inter- and Hinduja, 2006]. The problem was further exacerbated net behavior [Zheng, 2012]. This is one of the reasons of- by the popularization of smartphones and tablet computers, fensive messages have existed for many years on the Internet. which allow nearly constant use of SNS at home, work/school With the increase of our dependence on technology in every- or in motion [Bull, 2010]. day lives, the problem gained on seriousness, and conceptual- Cyberbullying messages commonly ridicule someone’s ized itself in the form of online harassment, or cyberbullying personality, appearance or spread rumors. It can lead its vic- (CB) [Patchin and Hinduja, 2006; Hinduja and Patchin, 2010; tims to self mutilation, even suicides, or on the opposite, to Pyżalski, 2012; Lazuras et al., 2012]. attacking their offenders in revenge [Hinduja and Patchin, Some of the first research on CB, based on numerous sur- 2010]. Global increase of cyberbullying cases opened a pub- veys [Patchin and Hinduja, 2006] revealed that such harm- lic debate on whether such messages could be spotted earlier ful information may include threats, sexual remarks, pejo- to prevent the tragedies, and on the freedom of speech on the rative labels, or false statements aimed to humiliate others. Internet in general. When posted on a social network, such as Facebook, or Twit- In some countries, such as Japan, the problem has become ter, it could disclose humiliating private information of the serious enough to be noticed on a ministerial level [MEXT, victim defaming and ridiculing them publicly. Some re- 3 applied a lexicon of such words to train an SVM classifier. Table 1: Summary of previous research in CB detection. With a number of optimizations they were able to detect cy- Authors, Language of Feature Extraction Method Classification Method berbullying with 88.2% of F-score. However, increasing the Year processing data caused a decrease in results, which made them abandon [Ptaszynski et Japanese unigrams (BoW), harmful word SVM SVM as not ideal for language ambiguities typical for CB. al., 2010] lexicon [Sood et al., English unigrams, bigrams, stems SVM, various weighting [Sood et al., 2012] focused on detection of personal in- 2012] (presence, freq, tf-idf) sults, negative influence of which could at most cause the In- [Dinakar et English unigrams, hand-crafted word- SVM, JRip, Naı̈ve Bayes, ternet community to fall into recession. In their research they al., 2012] lists (Ortony lexicon of negative J48 words, profane words, etc.), POS used as features single words and bigrams, weighted them us- [Nitta et al., Japanese seed words in 3 categories SO-PMI-IR maximized ing either presence (1/0), term frequency or tf-idf, and used 2013] for category them to train an SVM classifier. As a dataset they used a cor- [Cano Basave English violence-related words (derived VDM (weakly super- et al., 2013] from violence-related topics from vised Bayesian model) pus of six thousand entries they collected from various online Twitter and DBPedia) fora. To prepare gold standard for their experiments they used [Sarna and English (?) “bad words”, positive and nega- Naı̈ve Bayes, kNN, Deci- a crowd-sourcing approach with untrained layperson annota- Bhatia, 2015] tive sentiment words, pronouns, sion Trees, SVM proper nouns, links tors hired for a classification task through Mechanical Turk. [Ptaszynski et Japanese Brute-Force search algorithm custom pattern matching Later, [Dinakar et al., 2012] proposed their approach to al., 2015a] classifier [Ptaszynski et Japanese Brute-Force search algorithm custom pattern matching detection and mitigation of cyberbullying. An improvement al., 2015b] with various features (tokens, classifier of this paper in comparison to previous research was its wider POS, lemmas, etc.) perspective, in which they did not only focus on the detection, [Ptaszynski et Japanese seed words grouped in 3 cate- SO-PMI-IR maximized al., 2016] gories for category with seed but also proposed some ways for mitigation. The classifiers word optimization they used scored up to 58-77% of F-score depending on the kind of detected harassment. Their best proposed classifier ported that CB happens for up to eight percent of children was SVM, which confirmed considerably high effectiveness in schools in: Australia [Cross et al., 2009], United States of SVM for cyberbullying in English, similarly to the research [Kowalski and Limber, 2007], or Finland [Sourander et al., done by Ptaszynski et al., for Japanese in 2010. 2010]. Studies on CB across Europe indicate that even one An interesting work was done by [Kontostathis et al., in five young people (not limited to school environment) 2013], who performed a thorough analysis of cyberbullying could be exposed to cyberbullying [Hasebrink et al., 2008; entries on Formspring.me. They were able to identify com- Pyżalski, 2012]. As of 2015 the urgent need to deal with CB mon cyberbullying terms, and applied them in classification has even made insurance companies offer policies from costs with the use of a machine learning method based on Essential that could occur as a result of cyberbullying1 . Dimensions of atent Semantic Indexing (EDLSI). In Japan, after a several suicide cases of CB victims, Min- [Cano Basave et al., 2013] proposed Violence Detec- istry of Education, Culture, Sports, Science and Technology tion Model (VDM), a weekly supervised Bayesian model. (MEXT) increased the priority of the problem, provided a They did not however focused strictly on cyberbullying, but yearly updated manual for handling CB cases and incorpo- widened their scope to more generally understood “violence,” rated it in school staff education program [MEXT, 2008]. which made the problem more understandable, and thus fea- To actively deal with the problem, school staff are engaged sible for untrained annotators. The datasets were extracted in Internet Patrol (IP). Based on the MEXT definition of CB, from violence-related topics on Twiter and DBPedia. they read through all Internet contents, and when they find a [Nitta et al., 2013] proposed a method to automatically de- harmful entry they send a deletion request to the Web page ad- tect harmful entries with an extended SO-PMI-IR score [Tur- ministrator and report the event to the police. Unfortunately, ney, 2002] to calculate the relevance of a document with since IP has been performed manually as a voluntary work, harmful contents. They also grouped the seed words into and the amounts of Internet fora and SNS to read through three categories (abusive, violent, obscene) and maximized grows exponentially, manual Web surveillance has been an the relevance of categories. Their method was evaluated com- uphill task, and a psychological burden for the IP members. paratively high with the best achieved Precision around 91% (although with Recall less then 10%). 2.2 Previous Research on Cyberbullying Detection Unfortunately, a re-evaluation of their method done by Although the problem of CB has been studied in social sci- [Ptaszynski et al., 2016] two years later, indicated that the ences and child psychology for over ten years[Patchin and method lost most of its Precision (over 30 percentage-point Hinduja, 2006; Pyżalski, 2012], only few attempts were made drop) in that time. They hypothesized that this was caused so far to detect and study the problem with the help of in- by external factors such as Web page re-ranking, or changes formation technology. Below we present the most relevant in SNS user policies, etc. They improved the method by au- research to this day (also summarized in Table 1). tomatically acquiring and filtering new harmful seed words As the first recorded study, [Ptaszynski et al., 2010] per- with some success (P=76%). Unfortunately, they were un- formed affect analysis on a small dataset of CB entries to find able to revive the method to their original performance. out that distinctive features for CB were vulgar words. They [Sarna and Bhatia, 2015] based their method on a set of features like “bad words”, positive/negative sentiment words, 1 http://news.na.chubb.com/2016-03-30-Cyber-Bullying-Insurance-Now- and other common features like pronouns, etc., to estimate Available-to-Chubbs-U-S-Homeowners-Customers-2 [accessed on 2017/02/19] 4 user credibility. They applied those features to four stan- alyzer and CaboCha3 , a Japanese dependency structure ana- dard classifiers (Naı̈ve Bayes, kNN, Decision Trees, SVM). lyzer to preprocess the dataset in the following ways4 . The results of the classification were further used in User Behavior Analysis model (BAU), and User Credibility Anal- • Tokenization: All words, punctuation marks, etc. are ysis (CAU) model. Unfortunately, although their approach separated by spaces (later: TOK). suggested inclusion of phenomena such as irony, or rumors, • Lemmatization: Like the above but the words are rep- in practice they only focused on messages containing “bad resented in their generic (dictionary) forms, or “lemmas” words.” Moreover, neither these words, the dataset, nor its (later: LEM). annotation schema were sufficiently described in the paper. • Parts of speech: Words are replaced with their repre- sentative parts of speech (later: POS). Finally, [Ptaszynski et al., 2015a] proposed a method of • Tokens with POS: Both words and POS information is pattern-based language modeling. The patterns, defined as included in one element (later: TOK+POS). ordered combinations of sentence elements, were extracted • Lemmas with POS: Like the above but with lemmas with a Brute-Force search algorithm and used in classifica- instead of words (later: LEM+POS). tion. They reported encouraging initial results, and further • Tokens with Named Entity Recognition: Words en- improved the method by applying multiple data preprocess- coded together with with information on what named ing techniques [Ptaszynski et al., 2015b]. At present their entities (private name of a person, organization, numeri- method is considered as the best performing method, thus we cals, etc.) appear in the sentence. The NER information will use it in comparison with the method proposed herein. is annotated by CaboCha (later: TOK+NER). 2.3 Research Gaps • Lemmas with NER: Like the above but with lemmas (later: LEM+NER). Dataset preparation Some of the above-mentioned meth- • Chunking: Larger sub-parts of sentences separated syn- ods suffer from subjective data preparation. In [Cano Basave tactically, such as noun phrase, verb phrase, predicates, et al., 2013] or [Dinakar et al., 2012], the problem was not etc., but without dependency relations (later: CHNK). defined strictly enough and annotated by laypeople, while CB • Dependency structure: Same as above, but with infor- is a complex social phenomenon and needs to be handled by mation regarding syntactical relations between chunks experts. [Sood et al., 2012; Cano Basave et al., 2013] refor- (later: DEP). mulated the problem to be feasible by laypeople. [Dinakar • Chunking with NER: Information on named entities is et al., 2012] focused on overlapping concepts like sexual or encoded in chunks (later: CHNK+NER). racial harassment. Finally, [Sarna and Bhatia, 2015] collected • Dependency structure with Named Entities: Both de- the datasets with no specific standard. pendency relations and named entities are included in each element (later: DEP+NER). Feature selection Previous research included as features Five examples of preprocessing are represented in Table mostly words, or simple n-grams (bigrams). Some [Nitta 2. Theoretically, the more generalized a sentence is, the less et al., 2013] applied only a small number of features, while unique and frequent patterns it will contain, but the produced others [Dinakar et al., 2012] build up more complex mod- patterns will be more frequent (e.g., there are more ADJ N els, however still based mostly on words. Moreover, us- patterns than “pleasant day”). We compared the results for ing only top-down selected features [Nitta et al., 2013; different preprocessing methods to find out whether it is bet- Sarna and Bhatia, 2015], while somewhat reasonable (e.g., vi- ter to represent sentences as more generalized or specific. olent or obscene words) requires human workload and back- ground knowledge on the dataset, thus being inefficient. 3.2 Feature Extraction From each of the eleven dataset versions a Bag-of-Words lan- Classification methods Although various classifiers have guage model is generated, producing eleven different models been tested (SVM, Naive Bayes, or Decision Trees), usually (Bag-of-Words/Tokens, Bag-of-Lemmas, Bag-of-POS, Bag- SVM reached highest, though mildly satisfying scores. To of-Chunks, etc.). Sentences from the dataset processed with overcome the performance of previous methods, we apply those models are used later in the input layer of classifica- Convolutional Neural Networks in classification and optimize tion. We also applied traditional weight calculation scheme, them by studying correlation of results with Feature Density. namely term frequency with inverse document frequency (tf*idf). Term frequency t f (t, d) refers here to the traditional 3 Proposed Methods 3 https://taku910.github.io/cabocha/ 3.1 Data Preprocessing 4 Performance of MeCab is reported around 95-97% [Mori and The dataset used in this research (see sect. 4.1) was in Neubig, 2014], and Cabocha around 90% [Taku Kudo, 2002] for Japanese. In transcription of Japanese language, spaces (“ ”) normal language. Although we acknowledge that in some cases the language of the Web could cause errors in POS tagging and word are not used. Therefore we needed to preprocess the dataset segmentation, we did not want to retrain the basic tools to fit our and make the sentences separable into elements for feature data because we wanted the method to work using widely available extraction. We used MeCab2 , a Japanese morphological an- resources, so it was easily reproducible. Also, we assumed that even if such errors occur, as long as they are systematic, they will not 2 http://taku910.github.io/mecab/ cause trouble. 5 JRip also known as Repeated Incremental Pruning to Pro- Table 2: Three examples of preprocessing of a sentence in duce Error Reduction (RIPPER) [Cohen, 1995], which learns Japanese; N = noun, PP = postpositional particle, ADV = ad- rules incrementally to further optimize them. It is efficient verb, ADJ = adjective, AUX = auxiliary verb, SYM = symbol, in classification of noisy text [Sasaki and Kita, 1998] and for 1D, 2D, ... = depth of dependency relation, *0, *1, *2, ... = this purpose was used in cyberbullying detection previously phrase number. [Dinakar et al., 2012]. Sentence: Transcription in alphabet: Kyōwanantekimochiiihinanda! Glosses: Today TOP what pleasant day COP EXCL J48 is an implementation of the C4.5 decision tree algo- Translation: What a pleasant day it is today! rithm [Quinlan, 1993], which firstly builds decision trees from a labeled dataset and each tree node selects the optimal Preprocessing examples splitting criterion further chosen to make the decision. –TOK: Kyō | wa | nante | kimochiii | hi | nanda | ! –POS: N | PP | ADV | ADJ | N | AUX | SYM –TOK+POS: Kyō N|wa PP|nante ADV|kimochi ii ADJ| hi N| Random Forest in training phase create multiple decision nanda AUX|! SYM trees to output the optimal class (mode of classes) in classi- –CHNK: Kyō wa | nante | kimochi ii | hi nanda! fication phase [Breiman, 2001]. An improvement of RF to –DEP: *0 3D Kyō wa | *1 2D nante | *2 3D kimochi ii | standard decision trees is their ability to correct overfitting to *3 -1D hi nanda! the training set common in decision trees [Hastie et al., 2013]. raw frequency, meaning the number of times a term t (word, token) occurs in a document d. Inverse document frequency SPEC or Sentence Pattern Extraction arChitecture id f (t, D) is the logarithm of the total number of documents [Ptaszynski et al., 2015a] is a custom feature extraction and |D| in the corpus divided by the number of documents con- classification system. The features are defined as ordered taining the term nt . Finally, t f ∗ id f refers to term frequency combinations of sentence elements and contain patterns of multiplied by inverse document frequency as in equation 1. tokens and n-grams with disjoint elements. The way the features are extracted (combinatorial approach) resembles |D| brute-force search algorithms. Pattern occurrences for each id f (t, D) = log (1) nt side of binary class dataset are used to calculate normalized weight of patterns. Next, the score of a sentence is calculated 3.3 Classification methods as a sum of weights of patterns found in input sentence. SVM or Support-vector machines [Cortes and Vapnik, With multiple modifications, such as deletion of ambiguous 1995] are a set of classifiers well established in AI and patterns, or various dataset preprocessing [Ptaszynski et NLP. SVM represent data, belonging to specified categories, al., 2015b] were able to optimize the method to achieve as points in space, and find an optimal hyperplane to sep- somewhat high results, and has been considered as best arate the examples from each category. SVM were often performing method so far for cyberbullying detection. used in cyberbullying detection (see Table 1). We used In comparison we used their results optimized either for F1 four types of SVM functions, namely, linear - the original or BEP (break-even point of Precision and Recall). function which finds the maximum-margin hyperplane divid- ing the samples; plynomial kernel, in which training sam- ples are represented in a feature space over polynomials of CNN or Convolutional Neural Networks are a type of feed- the original variables (also used in [Dinakar et al., 2012]); forward artificial neural network are an improved neural net- radial basis function (RBF) kernel, which approximates mul- work model, i.e., multilayer perceptron. Although originally, tivariate functions with a single univariate function, further CNN were designed for image recognition, their performance radialised to be used in higher dimensions; and sigmoid, i.e., has been confirmed in many tasks, including NLP [Collobert and Weston, 2008] and sentence classification [Kim, 2014]. hyperbolic tangent function [Lin and Lin, 2003]. We applied a Convolutional Neural Network implementa- tion with Rectified Linear Units (ReLU) as a neuron acti- Naı̈ve Bayes classifier is a supervised learning algorithms vation function [Nair and Hinton, 2010], and max pooling applying Bayes’ theorem with the assumption of a strong [Scherer et al., 2010], which applies a max filter to non- (naive) independence between pairs of features, traditionally overlying sub-parts of the input to reduce dimensionality and used as a baseline in text classification tasks. in effect correct over-fitting. We also applied dropout regu- larization on penultimate layer, which prevents co-adaptation of hidden units by randomly omitting (dropping out) some of kNN or the k-Nearest Neighbors classifier takes as input k- the hidden units during training [Hinton et al., 2012]. closest training samples with assigned classes and classifies We applied two version of CNN. First, with one hidden input sample to a class by a majority vote. It is often applied convolutional layer containing 100 units was applied as a pro- as a baseline, next to Naı̈ve Bayes. Here, we used k=1 setting posed baseline. Second, the final proposed method consisted in which the input sample is assigned to the class of the first of two hidden convolutional layers, containing 20 and 100 nearest neighbor. feature maps, respectively, both layers with 5x5 size of patch 6 and 2x2 max-pooling, and Stochastic Gradient Descent [Le- by the number of all words in the corpus. Since in our re- Cun et al., 2012]. search we use a variety of different features, not only words, we will further call this measure Feature Density (FD). 4 Evaluation Experiments After calculating FD for all used datasets we calculated Pearson’s correlation coefficient (ρ-value) to see if there is 4.1 Dataset any correlation between dataset generalization (FD) and the As the dataset for experiments we used the one created orig- results (F-scores). inally by [Ptaszynski et al., 2010], and also widely used by [Nitta et al., 2013; Ptaszynski et al., 2015a; 2015b; 4.3 Results and Discussion 2016]. It contains 1,490 harmful and 1,508 non-harmful en- All results were summarized in Table 4. The results of the tries in Japanese collected from unofficial school Web sites baselines (kNN, Naı̈ve Bayes) were low, as assumed. Al- and fora. The original data was provided by the Human though these classifiers can be tuned to high scores in typical Rights Research Institute Against All Forms for Discrimina- sentiment analysis, they were not able to grasp the noisy lan- tion and Racism in Mie Prefecture, Japan5 . The harmful and guage used in cyberbullying. However, it must be noticed non-harmful sentences were collected and manually labeled that, especially with the help of named entities (NER), NB by Internet Patrol members (expert annotators) according to performed rather well, comparably to J48 or JRip. instructions included in the governmental manual for dealing When it comes to decision trees-based classifiers, J48 with cyberbullying [MEXT, 2008]. Some of those instruc- scored low, similarly as in [Dinakar et al., 2012]. However, tions are explained shortly below. Random Forest usually scored better even than SPEC. Unfor- The MEXT definition assumes that cyberbullying happens tunately, RF is highly time-inefficient, especially compared when a person is personally offended on the Web. This in- to SVM, and thus impractical. cludes disclosing the person’s name, personal information In many previous research on CB detection SVM were and other areas of privacy. Therefore, as the first feature dis- most commonly used with various success. As we can ob- tinguishable for cyberbullying MEXT defines private names serve, choice of appropriate function with good preprocess- (also initials and nicknames), names of institutions and affil- ing makes SVM comparable even to the proposed CNN. The iations, private information (address, phone numbres, entries best setting was linear-SVM trained on lemmatized dataset revealing personal information, etc.) (F1=.825). Moreover, although not scoring the highest, when Moreover, literature on cyberbullying indicates vulgari- the ratio of time-performance to the results is considered, ties as one of the most distinctive features of cyberbully- SVM can be considered as the most efficient classifier6 . ing [Patchin and Hinduja, 2006; Hinduja and Patchin, 2014; As for the method of preprocessing, most often TOK+NER Ptaszynski et al., 2010]. Also according to MEXT vulgar lan- and LEM+NER scored highest. This can be explained by the guage is distinguishable for cyberbullying, due to its ability fact that the data, which was annotated by expert annotators to convey offenses against particular persons. In the prepared following official governmental definition of cyberbullying, dataset all entries containing any of the above information often contained revealing of private information. As named was classified as harmful. Some examples from the dataset entity recognition covered most of these cases, it is reasonable are represented in Table 3. that it helped extracting meaningful features. Only for SPEC the results for the two above settings were not available since 4.2 Experiment Setup [Ptaszynski et al., 2015b] did not apply them in their research. The preprocessed original dataset provides eleven separate The best so far method, SPEC, was in fact scoring high, datasets for the experiment see sect. 3.1 for details). Thus the second best after the proposed here CNN. SPEC was also bet- experiment was performed eleven times, one time for each ter then SVM on every dataset, except one (LEM). Although kind of preprocessing. Each of the classifiers (sect. 3.3) was SPEC is highly time inefficient in the training phase (gener- tested on each version of the dataset in a 10-fold cross vali- ation of all combinatorial patterns), it is easy to implement dation procedure The results were calculated using standard and even in its fresh-trained form can be applied without any Precision (P), Recall (R), balanced F-score (F1) and Accu- additional packages to any external media. This could be an racy (A) As for the winning condition, we looked at which advantage when including it in CB detection software, such classifier achieved highest balanced F-score. as smartphone application, etc. When it comes to the proposed method, the initial baseline- Feature Density CNN with only one hidden layer did not perform well, al- To get a better grasp on the results we also analyzed the in- though still was better than the baselines and comparable to fluence of how a dataset was preprocessed on the results. most of the classifiers. A dataset is the more generalized, the fewer number of fre- However, the final proposed method, namely, the CNN quently appearing unique features it produces. Therefore to with two hidden layers, 5x5 patch size, max-pooling and estimate dataset generalization level we applied the notion of Stochastic Gradient Descent, outperformed all of the classi- Lexical Density (LD) [Ure, 1971]. It is a score representing an estimated measure of content per lexical units for a given 6 Training SVM, and simple classifiers (NB, kNN) was blazing corpus, calculated as the number of all unique words divided fast (several seconds). Simple CNN, Random Forest and JRip was slower (several minutes to 1-2 hours). SPEC and CNN-2L were the 5 http://www.pref.mie.lg.jp/jinkenc/hp/ longest (about one week). 7 Table 3: Three examples of cyberbullying entries gathered during Internet Patrol. The upper three represent strong sarcasm despite of the use of positive expressions in the sentence. English translation below Japanese content. >>104 Senzuri koi te shinu nante? sonna hageshii senzuri sugee naa. ”Senzuri masutaa” toshite isshou agamete yaru yo. >>104 Dying by ’flicking the bean’? Can’t imagine how one could do it so fiercely. I’m gonna worship her as a ’master-bator’, that’s for sure. 2-nen no tsutsuji no onna meccha busu suki na hito barashimashoka? 1-nen no anoko desuyo ne? kimogatterunde yamete agete kudasai Wanna know who likes that awfuly ugly 2nd-grade Azalea girl? Its that 1st-grader isn’t it? He’s disgusting, so let’s leave him mercifully in peace. Aitsu wa busakute sega takai dake no onna, busakute se takai dake ya noni yatara otoko-zuki meccha tarashide panko anna onna owatteru She’s just tall and apart of that she’s so freakin’ ugly, and despite of that she’s such a cock-loving slut, she’s finished already. Shinde kureeee, daibu kiraware-mono de yuumei, subete ga itaitashii... Please, dieeee, you’re so famous for being disliked by everyone, everything in you is so pathetic Table 4: Results of all applied classifiers (Scores averaged for Table 5: Left part: Information on features (unique features, positive and negative prediction calculated separately; best Feature Density) for each dataset. Right part: Pearson Corre- classifier for each dataset in bold type fond; best dataset gen- lation Coefficient (ρ-value) of Feature Density and all classi- eralization for each classifier – underlined). fier results, with statistical significance (2-sided p-value). LEM TOK LEMTOKCHNKPOS DEP DEP CHNK LEM TOK Dataset # unique # all Feature Classifier 2-sided +POS+POS +NER +NER +NER+NER Preprocessing 1grams 1grams Density ρ-value p-value Prec .639 .630 .644 .630 .578 .544 .593 .628 .576 .668 .663 DEP 12802 13957 0.917 CNN-2L 0.685 *p=0.035 high → kNN Rec .636 .627 .640 .628 .546 .543 .529 .528 .550 .663 .659 DEP+NER 12160 13956 0.871 SVM-pol -0.431 p=0.185 Fetaure sophistication (k=1) F1 .633 .625 .637 .626 .505 .542 .446 .427 .494 .658 .656 CHUNK 11389 13960 0.816 SVM-sig -0.534 p=0.091 Acc .636 .627 .640 .628 .546 .543 .529 .528 .550 .663 .659 CHUNK+NER 10657 13872 0.768 SPEC-BEP -0.550 p=0.133 Prec .678 .671 .686 .682 .666 .570 .652 .672 .685 .708 .703 TOK+POS 6565 34874 0.188 RandForest -0.560 p=0.073 Naı̈ve Rec .674 .669 .682 .678 .627 .569 .578 .555 .598 .705 .701 TOK 6464 36234 0.178 SVM-lin -0.564 p=0.076 Bayes F1 .673 .668 .681 .677 .599 .568 .511 .453 .539 .705 .701 LEM+POS 6227 36426 0.171 SPEC-F1 -0.636 p=0.066 Acc .674 .669 .682 .678 .627 .569 .578 .555 .598 .705 .701 LEM 6103 36412 0.168 SVM-rad -0.639 *p=0.034 TOK+NER 5967 38672 0.154 CNN-1L -0.709 *p=0.019 ← low Prec .606 .614 .604 .603 .628 .553 .643 .505 .685 .699 .700 LEM+NER 5581 38672 0.144 JRip -0.729 *p=0.011 Rec .606 .613 .603 .603 .555 .553 .533 .510 .598 .675 .672 JRip POS 13 26650 0.001 NB -0.736 *p=0.013 F1 .606 .613 .603 .603 .469 .553 .408 .345 .539 .663 .658 J48 -0.791 **p=0.006 Acc .606 .613 .603 .603 .555 .553 .533 .510 .598 .675 .712 ∗p≤ 0.05, ∗∗p≤ 0.01 → kNN -0.809 **p=0.004 Prec .672 .671 .683 .675 .615 .566 .652 .260 .645 .711 .707 Rec .671 .666 .681 .672 .548 .566 .533 .510 .517 .708 .704 J48 F1 .670 .663 .680 .669 .458 .566 .408 .344 .365 .706 .707 fiers in almost all settings. The only situation where SPEC Acc .671 .666 .681 .672 .548 .566 .533 .510 .517 .708 .663 scored higher (only POS features) reveals well known char- Prec .816 .803 .818 .809 .662 .547 .623 .619 .639 .818 .809 Random Rec .809 .795 .809 .801 .632 .546 .607 .582 .580 .809 .802 acteristics of Neural Nets, which perform poorly on a small Forest F1 .808 .794 .808 .799 .610 .544 .590 .540 .522 .807 .800 number of unique features. As for the second situation, using Acc .809 .795 .809 .801 .632 .546 .607 .582 .580 .809 .802 Prec .777 .768 .827 .777 .679 .563 .651 .639 .606 .820 .781 only dependency features (DEP), which on the other hand contained the largest number of features, generation of the linear Rec .777 .766 .825 .776 .645 .563 .615 .577 .603 .818 .781 F1 .776 .766 .825 .775 .623 .563 .586 .531 .597 .818 .780 model was not feasible on the applied computer, and thus the Acc .777 .766 .825 .776 .645 .563 .615 .577 .603 .818 .781 results were not calculated. In the near future we plan to re- plynomial Prec .262 .499 .262 .263 .260 .553 .260 .260 .260 .260 .260 Rec .512 .499 .512 .513 .510 .545 .510 .510 .510 .510 .510 peat the experiment in a more efficient environment, such as a F1 .346 .450 .347 .348 .344 .528 .344 .344 .344 .344 .344 cloud computing service (Google Cloud Platform7 , Microsoft SVM Acc .512 .499 .512 .513 .510 .545 .510 .510 .510 .510 .510 Prec .797 .753 .262 .793 .260 .565 .260 .260 .260 .752 .779 Azure8 , or Amazon EC29 ). radial Rec .771 .747 .512 .756 .510 .565 .510 .510 .510 .516 .778 As for the top three best performing settings, 2-layer CNN F1 .765 .746 .347 .746 .344 .565 .344 .344 .344 .358 .778 Acc .771 .747 .512 .756 .510 .565 .510 .510 .510 .516 .778 trained on chunks alone scored high, close to 90% of F- Prec .757 .746 .262 .752 .260 .562 .260 .260 .260 .260 .771 score. Dependency features with NER was second best with sigmoid Rec .549 .736 .512 .538 .510 .562 .510 .510 .510 .510 .606 F1=92.7%. However, the most optimal setting was the 2-layer F1 .425 .733 .347 .403 .344 .561 .344 .344 .344 .344 .530 Acc .549 .736 .512 .538 .510 .562 .510 .510 .510 .510 .606 CNN trained on chunks with named entities and reached F- Prec .802 .786 .784 .770 .655 .614 .548 .591 .633 N/A N/A score equal 93.5%, which is a result far more satisfying then SPEC Rec .802 .786 .784 .770 .655 .614 .548 .591 .633 N/A N/A expected, and exceeds second non-NN classifier (SVM on (highest F1 .802 .786 .784 .770 .655 .614 .548 .591 .633 N/A N/A BEP) Acc .775 .780 .600 .765 .790 .635 .525 .640 .548 N/A N/A lemmas) over 10-percentage points. Prec .807 .756 .713 .724 .563 .528 .500 .491 .490 N/A N/A Next, we analyzed the correlation of data preprocessing SPEC Rec .798 .839 .885 .842 .879 .946 .982 1.000 1.000 N/A N/A (highest F1 .803 .796 .790 .778 .686 .677 .663 .658 .658 N/A N/A with Feature Density (FD). The results were represented in F1) Acc .808 .784 .770 .766 .603 .550 .510 .491 .490 N/A N/A table 5. CNN Prec .700 .678 .724 .704 .506 .544 .616 .499 .496 .720 .739 The results clearly divided the classifiers into three groups. (1 hid- Rec .700 .678 .724 .704 .510 .544 .525 .505 .509 .716 .738 den F1 .700 .677 .724 .704 .433 .544 .393 .461 .374 .714 .738 First group of the lowest performing classifiers (kNN, NB, layer) Acc .700 .678 .724 .704 .510 .544 .525 .505 .509 .716 .738 CNN Prec .867 .824 .820 .833 .936 .500 .929 N/A .899 .847 .849 (2 hid- Rec .866 .821 .819 .831 .935 .500 .927 N/A .893 .847 .849 7 https://cloud.google.com/ den F1 .866 .821 .819 .830 .935 .495 .927 N/A .892 .847 .849 8 https://azure.microsoft.com layers) Acc .866 .821 .819 .831 .935 .500 .927 N/A .893 .847 .849 9 https://aws.amazon.com/ec2/ 8 J48, JRip, CNN-1L) was strongly negatively correlated with vised bayesian model for violence detection in social me- FD, which means these classifiers lose their general perfor- dia. 2013. mance the more feature-dense is the model. This suggests [Cohen, 1995] William W Cohen. Fast effective rule induc- that such classifiers should be fed with a feature set of limited tion. In Proceedings of the twelfth international confer- density. ence on machine learning, pages 115–123, 1995. Second group contained the classifiers (all SVMs, Ran- dom Forest and both SPEC) that performed somewhat high. [Collobert and Weston, 2008] Ronan Collobert and Jason Their correlation with FD was negative from weak (-0.431) to Weston. A unified architecture for natural language pro- somewhat high (-0.639). This, supported by the lack of sta- cessing: Deep neural networks with multitask learning. In tistical significance, means that FD is does not correlate well Proceedings of the 25th international conference on Ma- with such classifiers and some other characteristics should be chine learning, pages 160–167. ACM, 2008. used for optimization of dataset preprocessing used in those [Cortes and Vapnik, 1995] Corinna Cortes and Vladimir classifiers. Vapnik. Support-vector networks. Machine learning, Finally, we made an interesting discovery about the corre- 20(3):273–297, 1995. lation between FD and our proposed method (2-layer CNN). [Cross et al., 2009] Donna Cross, Therese Shaw, Lydia The classifier correlated positively nearly strongly with FD. Hearn, Melanie Epstein, Helen Monks, Leanne Lester, This suggests that the performance could be improved by in- and Laura Thomas. Australian covert bullying prevalence creasing the feature density of the applied dataset. We plan to study. 2009. follow this path in the nearest future. [Dinakar et al., 2012] Karthik Dinakar, Birago Jones, 5 Conclusions and Future Work Catherine Havasi, Henry Lieberman, and Rosalind Picard. Common sense reasoning for detection, prevention, In this paper we presented our research on cyberbullying (CB) and mitigation of cyberbullying. ACM Transactions on detection. Cyberbullying has become a serious problem in Interactive Intelligent Systems (TiiS), 2(3):18, 2012. modern society always connected to the Internet. Manual measures, such as Internet Patrol, have been undertaken to [Hasebrink et al., 2008] Uwe Hasebrink, Sonia Livingstone, deal with CB, unfortunately, reading through the whole In- and Leslie Haddon. Comparing children’s online opportu- ternet to find CB entries is like looking for a needle in the nities and risks across europe: Cross-national comparisons haystack, while keeping the CB victims exposed to harmful for eu kids online. 2008. messages leads to serious consequences. [Hastie et al., 2013] T. Hastie, R. Tibshirani, and J. Fried- To help quickly respond to ever-growing CB problem, au- man. The Elements of Statistical Learning: Data Mining, tomatic cyberbullying detection research has started to sprout, Inference, and Prediction. Springer Series in Statistics. unfortunately, the results have been only partially satisfying. 2013. We proposed a Deep Learning approach to the problem, based [Hinduja and Patchin, 2010] Sameer Hinduja and Justin W on Convolutional Neural Networks (CNN). Patchin. Bullying, cyberbullying, and suicide. Archives The proposed optimized CNN model not only outper- of suicide research, 14(3):206–221, 2010. formed other classifiers by over 11-percentage-points, scor- ing a close to ideal F-score (93.5%), but also revealed an un- [Hinduja and Patchin, 2014] Sameer Hinduja and Justin W usual characteristics, by nearly strongly positively correlating Patchin. Bullying beyond the schoolyard: Preventing and with Feature Density. This provides an informative hint on responding to cyberbullying. Corwin Press, 2014. how to improve further not only the proposed method (by in- [Hinton et al., 2012] Geoffrey E. Hinton, Nitish Srivastava, creasing FD of dataset), but also other classifiers (decreasing Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- FD, etc.). dinov. Improving neural networks by preventing co- In the near future we plan to test the limits of potential op- adaptation of feature detectors. CoRR, abs/1207.0580, timization, also by applying different dataset preprocessing 2012. methods (sentiment, etc.), and different language models (n- [Kim, 2014] Yoon Kim. Convolutional neural networks for gram, skip-gram, language combinatorics, etc.). We also plan sentence classification. In Proceedings of the 2014 Con- to implement the developed model into a smartphone appli- ference on Empirical Methods in Natural Language Pro- cation for “in-the-field” testing, and further practical research cessing (EMNLP), pages 1746—-1751, 2014. on cyberbullying and ways of its mitigation. [Kontostathis et al., 2013] April Kontostathis, Kelly References Reynolds, Andy Garron, and Lynne Edwards. De- tecting cyber bullying: query terms and techniques. [Breiman, 2001] Leo Breiman. Random forests. Machine In Proceedings of the 5th annual acm web science learning, 45(1):5–32, 2001. conference, pages 195–204. ACM, 2013. [Bull, 2010] Glen Bull. The always-connected generation. [Kowalski and Limber, 2007] Robin M Kowalski and Su- Learning & Leading with Technology, 38(3):28–29, 2010. san P Limber. Electronic bullying among middle school [Cano Basave et al., 2013] Amparo Elizabeth Cano Basave, students. Journal of adolescent health, 41(6):S22–S30, Yulan He, Kang Liu, and Jun Zhao. A weakly super- 2007. 9 [Lazuras et al., 2012] Lambros Lazuras, Jacek Pyżalski, [Pyżalski, 2012] Jacek Pyżalski. From cyberbullying to Vassilis Barkoukis, and Haralambos Tsorbatzoudis. Em- electronic aggression: Typology of the phenomenon. pathy and moral disengagement in adolescent cyberbully- Emotional and behavioural difficulties, 17(3-4):305–317, ing: Implications for educational intervention and peda- 2012. gogical practice. 2012. [Quinlan, 1993] J.R. Quinlan. C4.5: Programs for Machine [LeCun et al., 2012] Yann A LeCun, Léon Bottou, Learning. Morgan Kaufmann series in machine learning. Genevieve B Orr, and Klaus-Robert Müller. Effi- Morgan Kaufmann Publishers, 1993. cient backprop. In Neural networks: Tricks of the trade, [Rutter, 1987] D.R. Rutter. Communicating by Telephone. pages 9–48. Springer, 2012. International Series in Experimental Social Psychology, [Lin and Lin, 2003] Hsuan-Tien Lin and Chih-Jen Lin. A Vol 15. Elsevier Science Limited, 1987. study on sigmoid kernels for svm and the training of non- [Sarna and Bhatia, 2015] Geetika Sarna and MPS Bhatia. psd kernels by smo-type methods. submitted to Neural Content based approach to find the credibility of user in Computation, pages 1–32, 2003. social networks: an application of cyberbullying. Inter- [MEXT, 2008] MEXT. ‘Netto-jō no ijime’ ni kansuru taiō national Journal Of Machine Learning and Cybernetics, manyuaru jirei shū (gakkō, kyōin muke) [“bullying on pages 1–13, 2015. the net” manual for handling and collection of cases (for [Sasaki and Kita, 1998] Minoru Sasaki and Kenji Kita. Rule- schools and teachers)] (in japanese). 2008. based text categorization using hierarchical categories. In [Mori and Neubig, 2014] Shinsuke Mori and Graham Neu- Systems, Man, and Cybernetics, 1998. 1998 IEEE Interna- big. Language resource addition: Dictionary or corpus? tional Conference on, volume 3, pages 2827–2830. IEEE, In LREC, pages 1631–1636, 2014. 1998. [Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. [Scherer et al., 2010] Dominik Scherer, Andreas Müller, and Rectified linear units improve restricted boltzmann ma- Sven Behnke. Evaluation of pooling operations in con- chines. In Proceedings of the 27th international con- volutional architectures for object recognition. In Inter- ference on machine learning (ICML-10), pages 807–814, national Conference on Artificial Neural Networks, pages 2010. 92–101. Springer, 2010. [Nitta et al., 2013] Taisei Nitta, Fumito Masui, Michal [Sood et al., 2012] Sara Owsley Sood, Elizabeth F Ptaszynski, Yasutomo Kimura, Rafal Rzepka, and Kenji Churchill, and Judd Antin. Automatic identification Araki. Detecting cyberbullying entries on informal school of personal insults on social news sites. Journal of websites based on category relevance maximization. In the American Society for Information Science and IJCNLP, pages 579–586, 2013. Technology, 63(2):270–285, 2012. [Patchin and Hinduja, 2006] J. W. Patchin and S. Hinduja. [Sourander et al., 2010] A. Sourander, A.B. Klomek, Bullies move beyond the schoolyard a preliminary look M. Ikonen, J. Lindroos, T. Luntamo, M. Koskelainen, at cyberbullying. Youth violence and juvenile justice, T. Ristkari, and H. Helenius. Psychosocial risk factors 4(2):148–169, 2006. associated with cyberbullying among adolescents: A [Ptaszynski et al., 2010] M. Ptaszynski, Dybala P., Matsuba population-based study. Archives of general psychiatry, T., Masui F., Rzepka R., Araki K., and Momouchi Y. In 67(7):720–728, 2010. the service of online order: Tackling cyber-bullying with [Taku Kudo, 2002] Yuji Matsumoto Taku Kudo. Japanese machine learning and affect analysis. International Jour- dependency analysis using cascaded chunking. In CoNLL nal of Computational Linguistics Research, 1(3):135–154, 2002: Proceedings of the 6th Conference on Natural Lan- 2010. guage Learning 2002 (COLING 2002 Post-Conference [Ptaszynski et al., 2015a] Michal Ptaszynski, Fumito Masui, Workshops), pages 63–69, 2002. Yasutomo Kimura, Rafal Rzepka, and Kenji Araki. Brute [Turney, 2002] Peter D Turney. Thumbs up or thumbs force works best against bullying. In IJCAI 2015 Work- down?: semantic orientation applied to unsupervised clas- shop on Intelligent Personalization (IP 2015), pages 28– sification of reviews. In Proceedings of ACL 2002, 29, 2015. pages 417–424. Association for Computational Linguis- [Ptaszynski et al., 2015b] Michal Ptaszynski, Fumito Masui, tics, 2002. Yasutomo Kimura, Rafal Rzepka, and Kenji Araki. Ex- [Ure, 1971] J. Ure. Lexical density and register differentia- tracting patterns of harmful expressions for cyberbullying tion. Applications of Linguistics, pages 443–452, 1971. detection. In Proceedings of 7th Language and Technology Conference(LTC’15), pages 370–375, 2015. [Zheng, 2012] R. Zheng. Evolving Psychological and Edu- [Ptaszynski et al., 2016] M. Ptaszynski, F. Masui, T. Nitta, cational Perspectives on Cyber Behavior. Premier refer- S. Hatakeyama, Y. Kimura, R. Rzepka, and K. Araki. Sus- ence source. Information Science Reference, 2012. tainable cyberbullying detection with category-maximized relevance of harmful phrases and double-filtered automatic optimization. International Journal of Child-Computer In- teraction, 8:15–30, 2016. 10