Learning Deep on Cyberbullying is Always Better Than Brute Force

               Michal Ptaszynski†, Juuso Kalevi Kristian Eronen‡† and Fumito Masui†
                             †Kitami Institute of Technology, Kitami, Japan
                          ‡Tampere University of Technology, Tampere, Finland
                  {ptaszynski,f-masui}@cs.kitami-it.ac.jp juuso.eronen@student.tut.fi


                          Abstract                                    2008]. As one of the ways to deal with the problem school
                                                                      personnel have started Internet Patrol (IP) to detect Web fo-
     In this paper we present our research on detection               rum sites and SNS containing cyberbullying contents. Unfor-
     of cyberbullying (CB), which stands for humiliat-                tunately, as IP is performed manually, reading through count-
     ing other people through the Internet. CB has be-                less amounts of Websites makes it an uphill struggle.
     come recognized as a social problem, and its mostly                 Some research have started developing methods for auto-
     juvenile victims usually fall into depression, self-             matic detection of CB to help in this struggle [Ptaszynski
     mutilate, or even commit suicide. To deal with the               et al., 2010; Dinakar et al., 2012; Ptaszynski et al., 2016].
     problem, school personnel performs Internet Patrol               Unfortunately, even with multiple improvements, the results
     (IP) by reading through the available Web contents               have remained merely partially satisfying. This is caused by
     to spot harmful entries. It is crucial to help IP mem-           a multitude of language ambiguities and styles used in CB.
     bers detect malicious contents more efficiently. A
                                                                         In this paper we propose a novel, Convolutional Neural
     number of research has tackled the problem dur-
                                                                      Networks (CNN) approach to automatic cyberbullying detec-
     ing recent years. However, due to complexity of
                                                                      tion. Moreover, based on the analysis of the characteristics
     language used in cyberbullying, the results has re-
                                                                      of CNN and the initial results, we propose an optimization of
     mained only mildly satisfying. We propose a novel
                                                                      CNN by increasing Feature Density of training data.
     method to automatic cyberbullying detection based
     on Convolutional Neural Networks and increased                      The rest of the paper is organized int he following way.
     Feature Density. The experiments performed on ac-                Firstly, we describe the problem of cyberbullying and present
     tual cyberbullying data showed a major advantage                 some of the previous research related to ours. Next, we de-
     of our approach to all previous methods, including               scribe the proposed method and other methods used for com-
     the best performing method so far based on Brute-                parison. Further, we present the dataset used in this research,
     Force Search algorithm.                                          and explain the evaluation settings, followed by analysis of
                                                                      experiment results and discussion.
1   Introduction                                                      2     Research Background
Recent years brought to light the problem of cyberbullying
(CB), defined as exploitation of open online means of com-            2.1    Cyberbullying: Description of a Problem
munication, such as Internet forum boards, or social network          The choice of media used in communication can cause in-
services (SNS) to convey harmful and disturbing information           creased psychological distance between interlocutors [Rutter,
about private individuals, often children and students [Patchin       1987], which can lead to empathy deficit, especially in Inter-
and Hinduja, 2006]. The problem was further exacerbated               net behavior [Zheng, 2012]. This is one of the reasons of-
by the popularization of smartphones and tablet computers,            fensive messages have existed for many years on the Internet.
which allow nearly constant use of SNS at home, work/school           With the increase of our dependence on technology in every-
or in motion [Bull, 2010].                                            day lives, the problem gained on seriousness, and conceptual-
   Cyberbullying messages commonly ridicule someone’s                 ized itself in the form of online harassment, or cyberbullying
personality, appearance or spread rumors. It can lead its vic-        (CB) [Patchin and Hinduja, 2006; Hinduja and Patchin, 2010;
tims to self mutilation, even suicides, or on the opposite, to        Pyżalski, 2012; Lazuras et al., 2012].
attacking their offenders in revenge [Hinduja and Patchin,               Some of the first research on CB, based on numerous sur-
2010]. Global increase of cyberbullying cases opened a pub-           veys [Patchin and Hinduja, 2006] revealed that such harm-
lic debate on whether such messages could be spotted earlier          ful information may include threats, sexual remarks, pejo-
to prevent the tragedies, and on the freedom of speech on the         rative labels, or false statements aimed to humiliate others.
Internet in general.                                                  When posted on a social network, such as Facebook, or Twit-
   In some countries, such as Japan, the problem has become           ter, it could disclose humiliating private information of the
serious enough to be noticed on a ministerial level [MEXT,            victim defaming and ridiculing them publicly. Some re-


                                                                  3
                                                                                               applied a lexicon of such words to train an SVM classifier.
  Table 1: Summary of previous research in CB detection.                                       With a number of optimizations they were able to detect cy-
Authors,         Language of Feature Extraction Method         Classification Method           berbullying with 88.2% of F-score. However, increasing the
Year             processing                                                                    data caused a decrease in results, which made them abandon
[Ptaszynski et    Japanese    unigrams (BoW), harmful word SVM                                 SVM as not ideal for language ambiguities typical for CB.
al., 2010]                    lexicon
[Sood et al.,     English     unigrams, bigrams, stems          SVM, various weighting
                                                                                                  [Sood et al., 2012] focused on detection of personal in-
2012]                                                           (presence, freq, tf-idf)       sults, negative influence of which could at most cause the In-
[Dinakar et       English     unigrams, hand-crafted word- SVM, JRip, Naı̈ve Bayes,            ternet community to fall into recession. In their research they
al., 2012]                    lists (Ortony lexicon of negative J48
                              words, profane words, etc.), POS                                 used as features single words and bigrams, weighted them us-
[Nitta et al.,    Japanese seed words in 3 categories           SO-PMI-IR maximized            ing either presence (1/0), term frequency or tf-idf, and used
2013]                                                           for category                   them to train an SVM classifier. As a dataset they used a cor-
[Cano Basave      English     violence-related words (derived VDM (weakly super-
et al., 2013]                 from violence-related topics from vised Bayesian model)          pus of six thousand entries they collected from various online
                              Twitter and DBPedia)                                             fora. To prepare gold standard for their experiments they used
[Sarna    and     English (?) “bad words”, positive and nega- Naı̈ve Bayes, kNN, Deci-         a crowd-sourcing approach with untrained layperson annota-
Bhatia, 2015]                 tive sentiment words, pronouns, sion Trees, SVM
                              proper nouns, links                                              tors hired for a classification task through Mechanical Turk.
[Ptaszynski et    Japanese Brute-Force search algorithm         custom pattern matching           Later, [Dinakar et al., 2012] proposed their approach to
al., 2015a]                                                     classifier
[Ptaszynski et    Japanese Brute-Force search algorithm custom pattern matching
                                                                                               detection and mitigation of cyberbullying. An improvement
al., 2015b]                   with various features (tokens, classifier                        of this paper in comparison to previous research was its wider
                              POS, lemmas, etc.)                                               perspective, in which they did not only focus on the detection,
[Ptaszynski et    Japanese seed words grouped in 3 cate- SO-PMI-IR maximized
al., 2016]                    gories                            for category with seed
                                                                                               but also proposed some ways for mitigation. The classifiers
                                                                word optimization              they used scored up to 58-77% of F-score depending on the
                                                                                               kind of detected harassment. Their best proposed classifier
ported that CB happens for up to eight percent of children                                     was SVM, which confirmed considerably high effectiveness
in schools in: Australia [Cross et al., 2009], United States                                   of SVM for cyberbullying in English, similarly to the research
[Kowalski and Limber, 2007], or Finland [Sourander et al.,                                     done by Ptaszynski et al., for Japanese in 2010.
2010]. Studies on CB across Europe indicate that even one                                         An interesting work was done by [Kontostathis et al.,
in five young people (not limited to school environment)                                       2013], who performed a thorough analysis of cyberbullying
could be exposed to cyberbullying [Hasebrink et al., 2008;                                     entries on Formspring.me. They were able to identify com-
Pyżalski, 2012]. As of 2015 the urgent need to deal with CB                                   mon cyberbullying terms, and applied them in classification
has even made insurance companies offer policies from costs                                    with the use of a machine learning method based on Essential
that could occur as a result of cyberbullying1 .                                               Dimensions of atent Semantic Indexing (EDLSI).
   In Japan, after a several suicide cases of CB victims, Min-                                    [Cano Basave et al., 2013] proposed Violence Detec-
istry of Education, Culture, Sports, Science and Technology                                    tion Model (VDM), a weekly supervised Bayesian model.
(MEXT) increased the priority of the problem, provided a                                       They did not however focused strictly on cyberbullying, but
yearly updated manual for handling CB cases and incorpo-                                       widened their scope to more generally understood “violence,”
rated it in school staff education program [MEXT, 2008].                                       which made the problem more understandable, and thus fea-
   To actively deal with the problem, school staff are engaged                                 sible for untrained annotators. The datasets were extracted
in Internet Patrol (IP). Based on the MEXT definition of CB,                                   from violence-related topics on Twiter and DBPedia.
they read through all Internet contents, and when they find a                                     [Nitta et al., 2013] proposed a method to automatically de-
harmful entry they send a deletion request to the Web page ad-                                 tect harmful entries with an extended SO-PMI-IR score [Tur-
ministrator and report the event to the police. Unfortunately,                                 ney, 2002] to calculate the relevance of a document with
since IP has been performed manually as a voluntary work,                                      harmful contents. They also grouped the seed words into
and the amounts of Internet fora and SNS to read through                                       three categories (abusive, violent, obscene) and maximized
grows exponentially, manual Web surveillance has been an                                       the relevance of categories. Their method was evaluated com-
uphill task, and a psychological burden for the IP members.                                    paratively high with the best achieved Precision around 91%
                                                                                               (although with Recall less then 10%).
2.2     Previous Research on Cyberbullying Detection                                              Unfortunately, a re-evaluation of their method done by
Although the problem of CB has been studied in social sci-                                     [Ptaszynski et al., 2016] two years later, indicated that the
ences and child psychology for over ten years[Patchin and                                      method lost most of its Precision (over 30 percentage-point
Hinduja, 2006; Pyżalski, 2012], only few attempts were made                                   drop) in that time. They hypothesized that this was caused
so far to detect and study the problem with the help of in-                                    by external factors such as Web page re-ranking, or changes
formation technology. Below we present the most relevant                                       in SNS user policies, etc. They improved the method by au-
research to this day (also summarized in Table 1).                                             tomatically acquiring and filtering new harmful seed words
   As the first recorded study, [Ptaszynski et al., 2010] per-                                 with some success (P=76%). Unfortunately, they were un-
formed affect analysis on a small dataset of CB entries to find                                able to revive the method to their original performance.
out that distinctive features for CB were vulgar words. They                                      [Sarna and Bhatia, 2015] based their method on a set of
                                                                                               features like “bad words”, positive/negative sentiment words,
     1 http://news.na.chubb.com/2016-03-30-Cyber-Bullying-Insurance-Now-                       and other common features like pronouns, etc., to estimate
Available-to-Chubbs-U-S-Homeowners-Customers-2 [accessed on 2017/02/19]


                                                                                           4
user credibility. They applied those features to four stan-          alyzer and CaboCha3 , a Japanese dependency structure ana-
dard classifiers (Naı̈ve Bayes, kNN, Decision Trees, SVM).           lyzer to preprocess the dataset in the following ways4 .
The results of the classification were further used in User
Behavior Analysis model (BAU), and User Credibility Anal-               • Tokenization: All words, punctuation marks, etc. are
ysis (CAU) model. Unfortunately, although their approach                  separated by spaces (later: TOK).
suggested inclusion of phenomena such as irony, or rumors,              • Lemmatization: Like the above but the words are rep-
in practice they only focused on messages containing “bad                 resented in their generic (dictionary) forms, or “lemmas”
words.” Moreover, neither these words, the dataset, nor its               (later: LEM).
annotation schema were sufficiently described in the paper.             • Parts of speech: Words are replaced with their repre-
                                                                          sentative parts of speech (later: POS).
   Finally, [Ptaszynski et al., 2015a] proposed a method of
                                                                        • Tokens with POS: Both words and POS information is
pattern-based language modeling. The patterns, defined as
                                                                          included in one element (later: TOK+POS).
ordered combinations of sentence elements, were extracted
                                                                        • Lemmas with POS: Like the above but with lemmas
with a Brute-Force search algorithm and used in classifica-
                                                                          instead of words (later: LEM+POS).
tion. They reported encouraging initial results, and further
                                                                        • Tokens with Named Entity Recognition: Words en-
improved the method by applying multiple data preprocess-
                                                                          coded together with with information on what named
ing techniques [Ptaszynski et al., 2015b]. At present their
                                                                          entities (private name of a person, organization, numeri-
method is considered as the best performing method, thus we
                                                                          cals, etc.) appear in the sentence. The NER information
will use it in comparison with the method proposed herein.
                                                                          is annotated by CaboCha (later: TOK+NER).
2.3     Research Gaps                                                   • Lemmas with NER: Like the above but with lemmas
                                                                          (later: LEM+NER).
Dataset preparation Some of the above-mentioned meth-                   • Chunking: Larger sub-parts of sentences separated syn-
ods suffer from subjective data preparation. In [Cano Basave              tactically, such as noun phrase, verb phrase, predicates,
et al., 2013] or [Dinakar et al., 2012], the problem was not              etc., but without dependency relations (later: CHNK).
defined strictly enough and annotated by laypeople, while CB            • Dependency structure: Same as above, but with infor-
is a complex social phenomenon and needs to be handled by                 mation regarding syntactical relations between chunks
experts. [Sood et al., 2012; Cano Basave et al., 2013] refor-             (later: DEP).
mulated the problem to be feasible by laypeople. [Dinakar               • Chunking with NER: Information on named entities is
et al., 2012] focused on overlapping concepts like sexual or              encoded in chunks (later: CHNK+NER).
racial harassment. Finally, [Sarna and Bhatia, 2015] collected          • Dependency structure with Named Entities: Both de-
the datasets with no specific standard.                                   pendency relations and named entities are included in
                                                                          each element (later: DEP+NER).
Feature selection Previous research included as features                Five examples of preprocessing are represented in Table
mostly words, or simple n-grams (bigrams). Some [Nitta               2. Theoretically, the more generalized a sentence is, the less
et al., 2013] applied only a small number of features, while         unique and frequent patterns it will contain, but the produced
others [Dinakar et al., 2012] build up more complex mod-             patterns will be more frequent (e.g., there are more ADJ N
els, however still based mostly on words. Moreover, us-              patterns than “pleasant day”). We compared the results for
ing only top-down selected features [Nitta et al., 2013;             different preprocessing methods to find out whether it is bet-
Sarna and Bhatia, 2015], while somewhat reasonable (e.g., vi-        ter to represent sentences as more generalized or specific.
olent or obscene words) requires human workload and back-
ground knowledge on the dataset, thus being inefficient.             3.2      Feature Extraction
                                                                     From each of the eleven dataset versions a Bag-of-Words lan-
Classification methods Although various classifiers have             guage model is generated, producing eleven different models
been tested (SVM, Naive Bayes, or Decision Trees), usually           (Bag-of-Words/Tokens, Bag-of-Lemmas, Bag-of-POS, Bag-
SVM reached highest, though mildly satisfying scores. To             of-Chunks, etc.). Sentences from the dataset processed with
overcome the performance of previous methods, we apply               those models are used later in the input layer of classifica-
Convolutional Neural Networks in classification and optimize         tion. We also applied traditional weight calculation scheme,
them by studying correlation of results with Feature Density.        namely term frequency with inverse document frequency
                                                                     (tf*idf). Term frequency t f (t, d) refers here to the traditional
3      Proposed Methods                                                    3 https://taku910.github.io/cabocha/
3.1     Data Preprocessing                                                 4 Performance of MeCab is reported around 95-97% [Mori and

The dataset used in this research (see sect. 4.1) was in             Neubig, 2014], and Cabocha around 90% [Taku Kudo, 2002] for
Japanese. In transcription of Japanese language, spaces (“ ”)        normal language. Although we acknowledge that in some cases the
                                                                     language of the Web could cause errors in POS tagging and word
are not used. Therefore we needed to preprocess the dataset
                                                                     segmentation, we did not want to retrain the basic tools to fit our
and make the sentences separable into elements for feature           data because we wanted the method to work using widely available
extraction. We used MeCab2 , a Japanese morphological an-            resources, so it was easily reproducible. Also, we assumed that even
                                                                     if such errors occur, as long as they are systematic, they will not
      2 http://taku910.github.io/mecab/                              cause trouble.


                                                                 5
                                                                         JRip also known as Repeated Incremental Pruning to Pro-
Table 2: Three examples of preprocessing of a sentence in                duce Error Reduction (RIPPER) [Cohen, 1995], which learns
Japanese; N = noun, PP = postpositional particle, ADV = ad-              rules incrementally to further optimize them. It is efficient
verb, ADJ = adjective, AUX = auxiliary verb, SYM = symbol,               in classification of noisy text [Sasaki and Kita, 1998] and for
1D, 2D, ... = depth of dependency relation, *0, *1, *2, ... =            this purpose was used in cyberbullying detection previously
phrase number.                                                           [Dinakar et al., 2012].
  Sentence:
  Transcription in alphabet: Kyōwanantekimochiiihinanda!
  Glosses: Today TOP what pleasant day COP EXCL                          J48 is an implementation of the C4.5 decision tree algo-
  Translation: What a pleasant day it is today!                          rithm [Quinlan, 1993], which firstly builds decision trees
                                                                         from a labeled dataset and each tree node selects the optimal
  Preprocessing examples
                                                                         splitting criterion further chosen to make the decision.
  –TOK: Kyō | wa | nante | kimochiii | hi | nanda | !
  –POS: N | PP | ADV | ADJ | N | AUX | SYM
  –TOK+POS: Kyō N|wa PP|nante ADV|kimochi ii ADJ| hi N|                 Random Forest in training phase create multiple decision
  nanda AUX|! SYM                                                        trees to output the optimal class (mode of classes) in classi-
  –CHNK: Kyō wa | nante | kimochi ii | hi nanda!                        fication phase [Breiman, 2001]. An improvement of RF to
  –DEP: *0 3D Kyō wa | *1 2D nante | *2 3D kimochi ii |                 standard decision trees is their ability to correct overfitting to
  *3 -1D hi nanda!                                                       the training set common in decision trees [Hastie et al., 2013].
raw frequency, meaning the number of times a term t (word,
token) occurs in a document d. Inverse document frequency                SPEC or Sentence Pattern Extraction arChitecture
id f (t, D) is the logarithm of the total number of documents            [Ptaszynski et al., 2015a] is a custom feature extraction and
|D| in the corpus divided by the number of documents con-                classification system. The features are defined as ordered
taining the term nt . Finally, t f ∗ id f refers to term frequency       combinations of sentence elements and contain patterns of
multiplied by inverse document frequency as in equation 1.               tokens and n-grams with disjoint elements. The way the
                                                                         features are extracted (combinatorial approach) resembles
                                           |D|                           brute-force search algorithms. Pattern occurrences for each
                       id f (t, D) = log                       (1)
                                            nt                           side of binary class dataset are used to calculate normalized
                                                                         weight of patterns. Next, the score of a sentence is calculated
3.3   Classification methods                                             as a sum of weights of patterns found in input sentence.
SVM or Support-vector machines [Cortes and Vapnik,                       With multiple modifications, such as deletion of ambiguous
1995] are a set of classifiers well established in AI and                patterns, or various dataset preprocessing [Ptaszynski et
NLP. SVM represent data, belonging to specified categories,              al., 2015b] were able to optimize the method to achieve
as points in space, and find an optimal hyperplane to sep-               somewhat high results, and has been considered as best
arate the examples from each category. SVM were often                    performing method so far for cyberbullying detection.
used in cyberbullying detection (see Table 1). We used                      In comparison we used their results optimized either for F1
four types of SVM functions, namely, linear - the original               or BEP (break-even point of Precision and Recall).
function which finds the maximum-margin hyperplane divid-
ing the samples; plynomial kernel, in which training sam-
ples are represented in a feature space over polynomials of              CNN or Convolutional Neural Networks are a type of feed-
the original variables (also used in [Dinakar et al., 2012]);            forward artificial neural network are an improved neural net-
radial basis function (RBF) kernel, which approximates mul-              work model, i.e., multilayer perceptron. Although originally,
tivariate functions with a single univariate function, further           CNN were designed for image recognition, their performance
radialised to be used in higher dimensions; and sigmoid, i.e.,           has been confirmed in many tasks, including NLP [Collobert
                                                                         and Weston, 2008] and sentence classification [Kim, 2014].
hyperbolic tangent function [Lin and Lin, 2003].
                                                                            We applied a Convolutional Neural Network implementa-
                                                                         tion with Rectified Linear Units (ReLU) as a neuron acti-
Naı̈ve Bayes classifier is a supervised learning algorithms              vation function [Nair and Hinton, 2010], and max pooling
applying Bayes’ theorem with the assumption of a strong                  [Scherer et al., 2010], which applies a max filter to non-
(naive) independence between pairs of features, traditionally            overlying sub-parts of the input to reduce dimensionality and
used as a baseline in text classification tasks.                         in effect correct over-fitting. We also applied dropout regu-
                                                                         larization on penultimate layer, which prevents co-adaptation
                                                                         of hidden units by randomly omitting (dropping out) some of
kNN or the k-Nearest Neighbors classifier takes as input k-              the hidden units during training [Hinton et al., 2012].
closest training samples with assigned classes and classifies               We applied two version of CNN. First, with one hidden
input sample to a class by a majority vote. It is often applied          convolutional layer containing 100 units was applied as a pro-
as a baseline, next to Naı̈ve Bayes. Here, we used k=1 setting           posed baseline. Second, the final proposed method consisted
in which the input sample is assigned to the class of the first          of two hidden convolutional layers, containing 20 and 100
nearest neighbor.                                                        feature maps, respectively, both layers with 5x5 size of patch


                                                                     6
and 2x2 max-pooling, and Stochastic Gradient Descent [Le-              by the number of all words in the corpus. Since in our re-
Cun et al., 2012].                                                     search we use a variety of different features, not only words,
                                                                       we will further call this measure Feature Density (FD).
4      Evaluation Experiments                                             After calculating FD for all used datasets we calculated
                                                                       Pearson’s correlation coefficient (ρ-value) to see if there is
4.1     Dataset                                                        any correlation between dataset generalization (FD) and the
As the dataset for experiments we used the one created orig-           results (F-scores).
inally by [Ptaszynski et al., 2010], and also widely used
by [Nitta et al., 2013; Ptaszynski et al., 2015a; 2015b;               4.3   Results and Discussion
2016]. It contains 1,490 harmful and 1,508 non-harmful en-             All results were summarized in Table 4. The results of the
tries in Japanese collected from unofficial school Web sites           baselines (kNN, Naı̈ve Bayes) were low, as assumed. Al-
and fora. The original data was provided by the Human                  though these classifiers can be tuned to high scores in typical
Rights Research Institute Against All Forms for Discrimina-            sentiment analysis, they were not able to grasp the noisy lan-
tion and Racism in Mie Prefecture, Japan5 . The harmful and            guage used in cyberbullying. However, it must be noticed
non-harmful sentences were collected and manually labeled              that, especially with the help of named entities (NER), NB
by Internet Patrol members (expert annotators) according to            performed rather well, comparably to J48 or JRip.
instructions included in the governmental manual for dealing              When it comes to decision trees-based classifiers, J48
with cyberbullying [MEXT, 2008]. Some of those instruc-                scored low, similarly as in [Dinakar et al., 2012]. However,
tions are explained shortly below.                                     Random Forest usually scored better even than SPEC. Unfor-
   The MEXT definition assumes that cyberbullying happens              tunately, RF is highly time-inefficient, especially compared
when a person is personally offended on the Web. This in-              to SVM, and thus impractical.
cludes disclosing the person’s name, personal information                 In many previous research on CB detection SVM were
and other areas of privacy. Therefore, as the first feature dis-       most commonly used with various success. As we can ob-
tinguishable for cyberbullying MEXT defines private names              serve, choice of appropriate function with good preprocess-
(also initials and nicknames), names of institutions and affil-        ing makes SVM comparable even to the proposed CNN. The
iations, private information (address, phone numbres, entries          best setting was linear-SVM trained on lemmatized dataset
revealing personal information, etc.)                                  (F1=.825). Moreover, although not scoring the highest, when
   Moreover, literature on cyberbullying indicates vulgari-            the ratio of time-performance to the results is considered,
ties as one of the most distinctive features of cyberbully-            SVM can be considered as the most efficient classifier6 .
ing [Patchin and Hinduja, 2006; Hinduja and Patchin, 2014;                As for the method of preprocessing, most often TOK+NER
Ptaszynski et al., 2010]. Also according to MEXT vulgar lan-           and LEM+NER scored highest. This can be explained by the
guage is distinguishable for cyberbullying, due to its ability         fact that the data, which was annotated by expert annotators
to convey offenses against particular persons. In the prepared         following official governmental definition of cyberbullying,
dataset all entries containing any of the above information            often contained revealing of private information. As named
was classified as harmful. Some examples from the dataset              entity recognition covered most of these cases, it is reasonable
are represented in Table 3.                                            that it helped extracting meaningful features. Only for SPEC
                                                                       the results for the two above settings were not available since
4.2     Experiment Setup                                               [Ptaszynski et al., 2015b] did not apply them in their research.
The preprocessed original dataset provides eleven separate                The best so far method, SPEC, was in fact scoring high,
datasets for the experiment see sect. 3.1 for details). Thus the       second best after the proposed here CNN. SPEC was also bet-
experiment was performed eleven times, one time for each               ter then SVM on every dataset, except one (LEM). Although
kind of preprocessing. Each of the classifiers (sect. 3.3) was         SPEC is highly time inefficient in the training phase (gener-
tested on each version of the dataset in a 10-fold cross vali-         ation of all combinatorial patterns), it is easy to implement
dation procedure The results were calculated using standard            and even in its fresh-trained form can be applied without any
Precision (P), Recall (R), balanced F-score (F1) and Accu-             additional packages to any external media. This could be an
racy (A) As for the winning condition, we looked at which              advantage when including it in CB detection software, such
classifier achieved highest balanced F-score.                          as smartphone application, etc.
                                                                          When it comes to the proposed method, the initial baseline-
Feature Density
                                                                       CNN with only one hidden layer did not perform well, al-
To get a better grasp on the results we also analyzed the in-          though still was better than the baselines and comparable to
fluence of how a dataset was preprocessed on the results.              most of the classifiers.
A dataset is the more generalized, the fewer number of fre-               However, the final proposed method, namely, the CNN
quently appearing unique features it produces. Therefore to            with two hidden layers, 5x5 patch size, max-pooling and
estimate dataset generalization level we applied the notion of         Stochastic Gradient Descent, outperformed all of the classi-
Lexical Density (LD) [Ure, 1971]. It is a score representing
an estimated measure of content per lexical units for a given              6 Training SVM, and simple classifiers (NB, kNN) was blazing
corpus, calculated as the number of all unique words divided           fast (several seconds). Simple CNN, Random Forest and JRip was
                                                                       slower (several minutes to 1-2 hours). SPEC and CNN-2L were the
      5 http://www.pref.mie.lg.jp/jinkenc/hp/                          longest (about one week).


                                                                   7
Table 3: Three examples of cyberbullying entries gathered during Internet Patrol. The upper three represent strong sarcasm
despite of the use of positive expressions in the sentence. English translation below Japanese content.
 >>104 Senzuri koi te shinu nante? sonna hageshii senzuri sugee naa. ”Senzuri masutaa” toshite isshou agamete yaru yo.
 >>104 Dying by ’flicking the bean’? Can’t imagine how one could do it so fiercely. I’m gonna worship her as a ’master-bator’, that’s for sure.
 2-nen no tsutsuji no onna meccha busu suki na hito barashimashoka? 1-nen no anoko desuyo ne? kimogatterunde yamete agete kudasai
 Wanna know who likes that awfuly ugly 2nd-grade Azalea girl? Its that 1st-grader isn’t it? He’s disgusting, so let’s leave him mercifully in peace.
 Aitsu wa busakute sega takai dake no onna, busakute se takai dake ya noni yatara otoko-zuki meccha tarashide panko anna onna owatteru
 She’s just tall and apart of that she’s so freakin’ ugly, and despite of that she’s such a cock-loving slut, she’s finished already.
 Shinde kureeee, daibu kiraware-mono de yuumei, subete ga itaitashii...
 Please, dieeee, you’re so famous for being disliked by everyone, everything in you is so pathetic


Table 4: Results of all applied classifiers (Scores averaged for                                 Table 5: Left part: Information on features (unique features,
positive and negative prediction calculated separately; best                                     Feature Density) for each dataset. Right part: Pearson Corre-
classifier for each dataset in bold type fond; best dataset gen-                                 lation Coefficient (ρ-value) of Feature Density and all classi-
eralization for each classifier – underlined).                                                   fier results, with statistical significance (2-sided p-value).
                  LEM TOK LEMTOKCHNKPOS DEP DEP CHNK LEM TOK                                                                         Dataset # unique  # all Feature   Classifier             2-sided
                  +POS+POS      +NER   +NER         +NER+NER                                                                   Preprocessing 1grams 1grams Density                  ρ-value   p-value

         Prec .639    .630 .644 .630      .578   .544 .593 .628 .576        .668    .663                                             DEP       12802   13957   0.917   CNN-2L         0.685 *p=0.035


                                                                                                                      high →
  kNN     Rec .636    .627 .640 .628      .546   .543 .529 .528 .550        .663    .659                                         DEP+NER       12160   13956   0.871 SVM-pol         -0.431 p=0.185

                                                                                                     Fetaure sophistication
  (k=1)    F1 .633    .625 .637 .626      .505   .542 .446 .427 .494        .658    .656                                           CHUNK       11389   13960   0.816   SVM-sig       -0.534 p=0.091
          Acc .636    .627 .640 .628      .546   .543 .529 .528 .550        .663    .659                                       CHUNK+NER       10657   13872   0.768 SPEC-BEP        -0.550 p=0.133
         Prec .678    .671 .686 .682      .666   .570 .652 .672 .685        .708    .703                                         TOK+POS        6565   34874   0.188 RandForest      -0.560 p=0.073
  Naı̈ve Rec .674     .669 .682 .678      .627   .569 .578 .555 .598        .705    .701                                             TOK        6464   36234   0.178   SVM-lin       -0.564 p=0.076
  Bayes F1 .673       .668 .681 .677      .599   .568 .511 .453 .539        .705    .701                                         LEM+POS        6227   36426   0.171 SPEC-F1         -0.636 p=0.066
          Acc .674    .669 .682 .678      .627   .569 .578 .555 .598        .705    .701                                             LEM        6103   36412   0.168 SVM-rad         -0.639 *p=0.034
                                                                                                                                 TOK+NER        5967   38672   0.154   CNN-1L        -0.709 *p=0.019
                                                                                                    ← low


         Prec .606    .614 .604 .603      .628   .553 .643 .505 .685        .699    .700
                                                                                                                                 LEM+NER        5581   38672   0.144      JRip       -0.729 *p=0.011
          Rec .606    .613 .603 .603      .555   .553 .533 .510 .598        .675    .672
   JRip                                                                                                                              POS          13   26650   0.001       NB        -0.736 *p=0.013
           F1 .606    .613 .603 .603      .469   .553 .408 .345 .539        .663    .658
                                                                                                                                                                           J48       -0.791 **p=0.006
          Acc .606    .613 .603 .603      .555   .553 .533 .510 .598        .675    .712
                                                                                                                                              ∗p≤ 0.05, ∗∗p≤ 0.01 →       kNN        -0.809 **p=0.004
         Prec .672    .671 .683 .675      .615   .566 .652 .260 .645        .711    .707
          Rec .671    .666 .681 .672      .548   .566 .533 .510 .517        .708    .704
    J48
           F1 .670    .663 .680 .669      .458   .566 .408 .344 .365        .706    .707         fiers in almost all settings. The only situation where SPEC
          Acc .671    .666 .681 .672      .548   .566 .533 .510 .517        .708    .663         scored higher (only POS features) reveals well known char-
         Prec .816    .803 .818 .809      .662   .547 .623 .619 .639        .818    .809
Random Rec .809       .795 .809 .801      .632   .546 .607 .582 .580        .809    .802         acteristics of Neural Nets, which perform poorly on a small
 Forest F1 .808       .794 .808 .799      .610   .544 .590 .540 .522        .807    .800         number of unique features. As for the second situation, using
          Acc .809    .795 .809 .801      .632   .546 .607 .582 .580        .809    .802
         Prec .777    .768 .827 .777      .679   .563 .651 .639 .606        .820    .781
                                                                                                 only dependency features (DEP), which on the other hand
                                                                                                 contained the largest number of features, generation of the
      linear


          Rec .777    .766 .825 .776      .645   .563 .615 .577 .603        .818    .781
           F1 .776    .766 .825 .775      .623   .563 .586 .531 .597        .818    .780         model was not feasible on the applied computer, and thus the
          Acc .777    .766 .825 .776      .645   .563 .615 .577 .603        .818    .781
                                                                                                 results were not calculated. In the near future we plan to re-
      plynomial


         Prec .262    .499 .262 .263      .260   .553 .260 .260 .260        .260    .260
          Rec .512    .499 .512 .513      .510   .545 .510 .510 .510        .510    .510         peat the experiment in a more efficient environment, such as a
           F1 .346    .450 .347 .348      .344   .528 .344 .344 .344        .344    .344         cloud computing service (Google Cloud Platform7 , Microsoft
SVM


          Acc .512    .499 .512 .513      .510   .545 .510 .510 .510        .510    .510
         Prec .797    .753 .262 .793      .260   .565 .260 .260 .260        .752    .779         Azure8 , or Amazon EC29 ).
      radial


          Rec .771    .747 .512 .756      .510   .565 .510 .510 .510        .516    .778            As for the top three best performing settings, 2-layer CNN
           F1 .765    .746 .347 .746      .344   .565 .344 .344 .344        .358    .778
          Acc .771    .747 .512 .756      .510   .565 .510 .510 .510        .516    .778
                                                                                                 trained on chunks alone scored high, close to 90% of F-
         Prec .757    .746 .262 .752      .260   .562 .260 .260 .260        .260    .771         score. Dependency features with NER was second best with
      sigmoid


          Rec .549    .736 .512 .538      .510   .562 .510 .510 .510        .510    .606         F1=92.7%. However, the most optimal setting was the 2-layer
           F1 .425    .733 .347 .403      .344   .561 .344 .344 .344        .344    .530
          Acc .549    .736 .512 .538      .510   .562 .510 .510 .510        .510    .606         CNN trained on chunks with named entities and reached F-
         Prec .802    .786 .784 .770      .655   .614 .548 .591 .633        N/A     N/A          score equal 93.5%, which is a result far more satisfying then
  SPEC Rec .802       .786 .784 .770      .655   .614 .548 .591 .633        N/A     N/A          expected, and exceeds second non-NN classifier (SVM on
(highest F1 .802      .786 .784 .770      .655   .614 .548 .591 .633        N/A     N/A
  BEP) Acc .775       .780 .600 .765      .790   .635 .525 .640 .548        N/A     N/A          lemmas) over 10-percentage points.
         Prec .807    .756 .713 .724      .563   .528 .500 .491 .490        N/A     N/A             Next, we analyzed the correlation of data preprocessing
  SPEC Rec .798       .839 .885 .842      .879   .946 .982 1.000 1.000      N/A     N/A
(highest F1 .803      .796 .790 .778      .686   .677 .663 .658 .658        N/A     N/A          with Feature Density (FD). The results were represented in
    F1) Acc .808      .784 .770 .766      .603   .550 .510 .491 .490        N/A     N/A          table 5.
   CNN Prec .700      .678 .724 .704      .506   .544 .616 .499 .496        .720    .739            The results clearly divided the classifiers into three groups.
 (1 hid- Rec .700     .678 .724 .704      .510   .544 .525 .505 .509        .716    .738
   den     F1 .700    .677 .724 .704      .433   .544 .393 .461 .374        .714    .738         First group of the lowest performing classifiers (kNN, NB,
  layer) Acc .700     .678 .724 .704      .510   .544 .525 .505 .509        .716    .738
   CNN Prec .867      .824 .820 .833      .936   .500 .929 N/A .899         .847    .849
 (2 hid- Rec .866     .821 .819 .831      .935   .500 .927 N/A .893         .847    .849                         7 https://cloud.google.com/
   den     F1 .866    .821 .819 .830      .935   .495 .927 N/A .892         .847    .849                         8 https://azure.microsoft.com
 layers) Acc .866     .821 .819 .831      .935   .500 .927 N/A .893         .847    .849
                                                                                                                 9 https://aws.amazon.com/ec2/


                                                                                             8
J48, JRip, CNN-1L) was strongly negatively correlated with                vised bayesian model for violence detection in social me-
FD, which means these classifiers lose their general perfor-              dia. 2013.
mance the more feature-dense is the model. This suggests                [Cohen, 1995] William W Cohen. Fast effective rule induc-
that such classifiers should be fed with a feature set of limited         tion. In Proceedings of the twelfth international confer-
density.                                                                  ence on machine learning, pages 115–123, 1995.
   Second group contained the classifiers (all SVMs, Ran-
dom Forest and both SPEC) that performed somewhat high.                 [Collobert and Weston, 2008] Ronan Collobert and Jason
Their correlation with FD was negative from weak (-0.431) to              Weston. A unified architecture for natural language pro-
somewhat high (-0.639). This, supported by the lack of sta-               cessing: Deep neural networks with multitask learning. In
tistical significance, means that FD is does not correlate well           Proceedings of the 25th international conference on Ma-
with such classifiers and some other characteristics should be            chine learning, pages 160–167. ACM, 2008.
used for optimization of dataset preprocessing used in those            [Cortes and Vapnik, 1995] Corinna Cortes and Vladimir
classifiers.                                                              Vapnik. Support-vector networks. Machine learning,
   Finally, we made an interesting discovery about the corre-             20(3):273–297, 1995.
lation between FD and our proposed method (2-layer CNN).                [Cross et al., 2009] Donna Cross, Therese Shaw, Lydia
The classifier correlated positively nearly strongly with FD.
                                                                          Hearn, Melanie Epstein, Helen Monks, Leanne Lester,
This suggests that the performance could be improved by in-
                                                                          and Laura Thomas. Australian covert bullying prevalence
creasing the feature density of the applied dataset. We plan to
                                                                          study. 2009.
follow this path in the nearest future.
                                                                        [Dinakar et al., 2012] Karthik Dinakar, Birago Jones,
5   Conclusions and Future Work                                           Catherine Havasi, Henry Lieberman, and Rosalind Picard.
                                                                          Common sense reasoning for detection, prevention,
In this paper we presented our research on cyberbullying (CB)             and mitigation of cyberbullying. ACM Transactions on
detection. Cyberbullying has become a serious problem in                  Interactive Intelligent Systems (TiiS), 2(3):18, 2012.
modern society always connected to the Internet. Manual
measures, such as Internet Patrol, have been undertaken to              [Hasebrink et al., 2008] Uwe Hasebrink, Sonia Livingstone,
deal with CB, unfortunately, reading through the whole In-                and Leslie Haddon. Comparing children’s online opportu-
ternet to find CB entries is like looking for a needle in the             nities and risks across europe: Cross-national comparisons
haystack, while keeping the CB victims exposed to harmful                 for eu kids online. 2008.
messages leads to serious consequences.                                 [Hastie et al., 2013] T. Hastie, R. Tibshirani, and J. Fried-
   To help quickly respond to ever-growing CB problem, au-                man. The Elements of Statistical Learning: Data Mining,
tomatic cyberbullying detection research has started to sprout,           Inference, and Prediction. Springer Series in Statistics.
unfortunately, the results have been only partially satisfying.           2013.
We proposed a Deep Learning approach to the problem, based              [Hinduja and Patchin, 2010] Sameer Hinduja and Justin W
on Convolutional Neural Networks (CNN).                                   Patchin. Bullying, cyberbullying, and suicide. Archives
   The proposed optimized CNN model not only outper-                      of suicide research, 14(3):206–221, 2010.
formed other classifiers by over 11-percentage-points, scor-
ing a close to ideal F-score (93.5%), but also revealed an un-          [Hinduja and Patchin, 2014] Sameer Hinduja and Justin W
usual characteristics, by nearly strongly positively correlating          Patchin. Bullying beyond the schoolyard: Preventing and
with Feature Density. This provides an informative hint on                responding to cyberbullying. Corwin Press, 2014.
how to improve further not only the proposed method (by in-             [Hinton et al., 2012] Geoffrey E. Hinton, Nitish Srivastava,
creasing FD of dataset), but also other classifiers (decreasing           Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
FD, etc.).                                                                dinov. Improving neural networks by preventing co-
   In the near future we plan to test the limits of potential op-         adaptation of feature detectors. CoRR, abs/1207.0580,
timization, also by applying different dataset preprocessing              2012.
methods (sentiment, etc.), and different language models (n-            [Kim, 2014] Yoon Kim. Convolutional neural networks for
gram, skip-gram, language combinatorics, etc.). We also plan
                                                                          sentence classification. In Proceedings of the 2014 Con-
to implement the developed model into a smartphone appli-
                                                                          ference on Empirical Methods in Natural Language Pro-
cation for “in-the-field” testing, and further practical research
                                                                          cessing (EMNLP), pages 1746—-1751, 2014.
on cyberbullying and ways of its mitigation.
                                                                        [Kontostathis et al., 2013] April     Kontostathis,      Kelly
References                                                                Reynolds, Andy Garron, and Lynne Edwards.               De-
                                                                          tecting cyber bullying: query terms and techniques.
[Breiman, 2001] Leo Breiman. Random forests. Machine                      In Proceedings of the 5th annual acm web science
  learning, 45(1):5–32, 2001.                                             conference, pages 195–204. ACM, 2013.
[Bull, 2010] Glen Bull. The always-connected generation.                [Kowalski and Limber, 2007] Robin M Kowalski and Su-
  Learning & Leading with Technology, 38(3):28–29, 2010.                  san P Limber. Electronic bullying among middle school
[Cano Basave et al., 2013] Amparo Elizabeth Cano Basave,                  students. Journal of adolescent health, 41(6):S22–S30,
  Yulan He, Kang Liu, and Jun Zhao. A weakly super-                       2007.


                                                                    9
[Lazuras et al., 2012] Lambros Lazuras, Jacek Pyżalski,              [Pyżalski, 2012] Jacek Pyżalski. From cyberbullying to
   Vassilis Barkoukis, and Haralambos Tsorbatzoudis. Em-                 electronic aggression: Typology of the phenomenon.
   pathy and moral disengagement in adolescent cyberbully-               Emotional and behavioural difficulties, 17(3-4):305–317,
   ing: Implications for educational intervention and peda-              2012.
   gogical practice. 2012.                                            [Quinlan, 1993] J.R. Quinlan. C4.5: Programs for Machine
[LeCun et al., 2012] Yann A LeCun, Léon Bottou,                         Learning. Morgan Kaufmann series in machine learning.
   Genevieve B Orr, and Klaus-Robert Müller.            Effi-           Morgan Kaufmann Publishers, 1993.
   cient backprop. In Neural networks: Tricks of the trade,           [Rutter, 1987] D.R. Rutter. Communicating by Telephone.
   pages 9–48. Springer, 2012.                                           International Series in Experimental Social Psychology,
[Lin and Lin, 2003] Hsuan-Tien Lin and Chih-Jen Lin. A                   Vol 15. Elsevier Science Limited, 1987.
   study on sigmoid kernels for svm and the training of non-          [Sarna and Bhatia, 2015] Geetika Sarna and MPS Bhatia.
   psd kernels by smo-type methods. submitted to Neural
                                                                         Content based approach to find the credibility of user in
   Computation, pages 1–32, 2003.
                                                                         social networks: an application of cyberbullying. Inter-
[MEXT, 2008] MEXT. ‘Netto-jō no ijime’ ni kansuru taiō                 national Journal Of Machine Learning and Cybernetics,
   manyuaru jirei shū (gakkō, kyōin muke) [“bullying on               pages 1–13, 2015.
   the net” manual for handling and collection of cases (for          [Sasaki and Kita, 1998] Minoru Sasaki and Kenji Kita. Rule-
   schools and teachers)] (in japanese). 2008.
                                                                         based text categorization using hierarchical categories. In
[Mori and Neubig, 2014] Shinsuke Mori and Graham Neu-                    Systems, Man, and Cybernetics, 1998. 1998 IEEE Interna-
   big. Language resource addition: Dictionary or corpus?                tional Conference on, volume 3, pages 2827–2830. IEEE,
   In LREC, pages 1631–1636, 2014.                                       1998.
[Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton.             [Scherer et al., 2010] Dominik Scherer, Andreas Müller, and
   Rectified linear units improve restricted boltzmann ma-               Sven Behnke. Evaluation of pooling operations in con-
   chines. In Proceedings of the 27th international con-                 volutional architectures for object recognition. In Inter-
   ference on machine learning (ICML-10), pages 807–814,                 national Conference on Artificial Neural Networks, pages
   2010.                                                                 92–101. Springer, 2010.
[Nitta et al., 2013] Taisei Nitta, Fumito Masui, Michal               [Sood et al., 2012] Sara Owsley Sood, Elizabeth F
   Ptaszynski, Yasutomo Kimura, Rafal Rzepka, and Kenji                  Churchill, and Judd Antin. Automatic identification
   Araki. Detecting cyberbullying entries on informal school             of personal insults on social news sites. Journal of
   websites based on category relevance maximization. In                 the American Society for Information Science and
   IJCNLP, pages 579–586, 2013.                                          Technology, 63(2):270–285, 2012.
[Patchin and Hinduja, 2006] J. W. Patchin and S. Hinduja.             [Sourander et al., 2010] A. Sourander, A.B. Klomek,
   Bullies move beyond the schoolyard a preliminary look                 M. Ikonen, J. Lindroos, T. Luntamo, M. Koskelainen,
   at cyberbullying. Youth violence and juvenile justice,                T. Ristkari, and H. Helenius. Psychosocial risk factors
   4(2):148–169, 2006.                                                   associated with cyberbullying among adolescents: A
[Ptaszynski et al., 2010] M. Ptaszynski, Dybala P., Matsuba              population-based study. Archives of general psychiatry,
   T., Masui F., Rzepka R., Araki K., and Momouchi Y. In                 67(7):720–728, 2010.
   the service of online order: Tackling cyber-bullying with          [Taku Kudo, 2002] Yuji Matsumoto Taku Kudo. Japanese
   machine learning and affect analysis. International Jour-             dependency analysis using cascaded chunking. In CoNLL
   nal of Computational Linguistics Research, 1(3):135–154,              2002: Proceedings of the 6th Conference on Natural Lan-
   2010.                                                                 guage Learning 2002 (COLING 2002 Post-Conference
[Ptaszynski et al., 2015a] Michal Ptaszynski, Fumito Masui,              Workshops), pages 63–69, 2002.
   Yasutomo Kimura, Rafal Rzepka, and Kenji Araki. Brute              [Turney, 2002] Peter D Turney. Thumbs up or thumbs
   force works best against bullying. In IJCAI 2015 Work-
                                                                         down?: semantic orientation applied to unsupervised clas-
   shop on Intelligent Personalization (IP 2015), pages 28–
                                                                         sification of reviews. In Proceedings of ACL 2002,
   29, 2015.
                                                                         pages 417–424. Association for Computational Linguis-
[Ptaszynski et al., 2015b] Michal Ptaszynski, Fumito Masui,              tics, 2002.
   Yasutomo Kimura, Rafal Rzepka, and Kenji Araki. Ex-                [Ure, 1971] J. Ure. Lexical density and register differentia-
   tracting patterns of harmful expressions for cyberbullying
                                                                         tion. Applications of Linguistics, pages 443–452, 1971.
   detection. In Proceedings of 7th Language and Technology
   Conference(LTC’15), pages 370–375, 2015.                           [Zheng, 2012] R. Zheng. Evolving Psychological and Edu-
[Ptaszynski et al., 2016] M. Ptaszynski, F. Masui, T. Nitta,             cational Perspectives on Cyber Behavior. Premier refer-
   S. Hatakeyama, Y. Kimura, R. Rzepka, and K. Araki. Sus-               ence source. Information Science Reference, 2012.
   tainable cyberbullying detection with category-maximized
   relevance of harmful phrases and double-filtered automatic
   optimization. International Journal of Child-Computer In-
   teraction, 8:15–30, 2016.


                                                                 10