=Paper=
{{Paper
|id=Vol-2935/paper1
|storemode=property
|title=Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection
|pdfUrl=https://ceur-ws.org/Vol-2935/paper1.pdf
|volume=Vol-2935
|authors=Juuso Eronen,Michal Ptaszynski,Fumito Masui,Gniewosz Leliwa,Michal Wroczynski
}}
==Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-2935/paper1.pdf</pdf>
<pre>
       Exploring the Potential of Feature Density in Estimating Machine Learning
          Classifier Performance with Application to Cyberbullying Detection
          Juuso Eronen1 , Michal Ptaszynski1 , Fumito Masui1 , Gniewosz Leliwa2 and Michal
                                              Wroczynski2
                                 1
                                   Kitami Institute of Technology, Japan
                                         2
                                           Samurai Labs, Poland
            eronen.juuso@gmail.com, {ptaszynski, f-masui}@cs.kitami-it.ac.jp, {gniewosz.leliwa,
                                  michal.wroczynski}@samurailabs.ai

                            Abstract                                      the classifier performance this way, it is multiple times more
                                                                          costly.
       In this research, we analyze the potential of Feature                 Previously, there have been some attempts to estimate the
       Density (FD) as a way to comparatively estimate                    performance of a ML model before any training. One pro-
       machine learning (ML) classifier performance prior                 posal to this problem is using meta-learning and training a
       to training. The goal of the study is to aid in solving            model using dataset characteristics to estimate classifier per-
       the problem of resource-intensive training of ML                   formance [Gama and Brazdil, 1995]. Another approach is
       models which is becoming a serious issue due to                    extrapolating results from small datasets to simulate the per-
       continuously increasing dataset sizes and the ever                 formance using larger datasets [Basavanhally et al., 2010].
       rising popularity of Deep Neural Networks (DNN).                      The importance of resolving this issue comes not only from
       The issue of constantly increasing demands for                     the increased computational requirements, but also from its
       more powerful computational resources is also af-                  environmental effect. This is directly caused by the increased
       fecting the environment, as training large-scale ML                popularity of the fields of Artificial Intelligence (AI) and ML.
       models are causing alarmingly-growing amounts of                   Training classifiers on large datasets is both time consuming
       CO2 emissions. Our approach is to optimize the                     and computationally intensive while leaving behind a notice-
       resource-intensive training of ML models for Nat-                  able carbon footprint [Strubell et al., 2019]. To move towards
       ural Language Processing to reduce the number of                   greener AI [Schwartz et al., 2019], it is necessary to inspect
       required experiments iterations. We expand on pre-                 the core of ML methods and find potential points of improve-
       vious attempts on improving classifier training ef-                ment. In order to save computational power and reduce emis-
       ficiency with FD while also providing an insight                   sions, it would be useful to roughly estimate classifier perfor-
       to the effectiveness of various linguistically-backed              mance prior to training.
       feature preprocessing methods for dialog classifica-                  The ability to estimate classifier performance before the
       tion, specifically cyberbullying detection.                        training would also have important practical implications. In
                                                                          dialog agent applications, one of the areas where the need
                                                                          for this is becoming more urgent is in forum moderation,
1     Introduction                                                        specifically the detection of harmful and abusive behaviour
One of the challenges in machine learning (ML) has always                 observed online, known as cyberbullying (CB). The number
been estimating how well different classification algorithms              of CB cases has been constantly growing since the increase
will perform with a given dataset. Although there are classi-             of the popularity of Social Networking Services (SNS) [Hin-
fiers that tend to be highly effective on a variety of different          duja and Patchin, 2010; Ptaszynski and Masui, 2018]. The
problems, they might be easily outperformed by others on a                consequences of unattended cases of online abuse are known
dataset specific scale. As it is difficult to identify a classifier       to be serious, leading the victims to self mutilation, or even
that would perform best with every kind of dataset [Michie et             suicides, or on the opposite, to attacking their offenders in
al., 1995], it comes down to the user (researcher, or ML prac-            revenge. Being able to roughly estimate which classifier set-
titioner) to determine experimentally, which classifier could             tings can be rejected, would make the process of implemen-
be appropriate based on their knowledge of the field and pre-             tation of automatic cyberbullying detection for various lan-
vious experiences.                                                        guages and social networking platforms more efficient.
    A common way when estimating the performance of dif-                     To contribute to that, we conduct an in-depth analysis of
ferent classifiers is to select a variety of possible classifiers         the effectiveness of FD proposed previously by [Ptaszynski et
to experiment on and train them using cross-validation to                 al., 2017] to comparatively estimate the performance of dif-
aid in getting the best possible average estimations of their             ferent classifiers before training. We also analyze the effec-
performances. With a sufficiently small dataset and using a               tiveness of various linguistically-backed feature preprocess-
computationally efficient algorithm, this approach works very             ing methods, including lemmas, Named Entity Recognition
well. Even though it is possible to get accurate estimations of           (NER) and dependency information-based features, with an


    Copyright © 2021 for this paper by its authors. Use
    permitted under Creative Commons License Attribution
    4.0 International (CC BY 4.0).                                    5
application to automatic cyberbullying detection.                     by the number of all words in the corpus. The score is called
                                                                      Feature Density as it also includes other features, like parts-
2     Previous Research                                               of-speech or dependency information, in addition to words.
                                                                         In this research, after calculating FD for all applied dataset
2.1    Classifier Performance Estimation                              preprocessing methods we calculated Pearson’s correlation
[Gama and Brazdil, 1995] proposed that classifier perfor-             coefficient (ρ-value) between dataset generalization (FD) and
mance could be estimated by training a regression model               classifier results (F-scores). If ideal ranges of FD can be
based on meta-level characteristics of a dataset. The charac-         identified, or FD has a positive or negative correlation with
teristics used included simple measures like number of exam-          classifier performance, it could be useful in comparatively
ples and number of attributes, statistical measures like stan-        estimating the performance of various classifiers. For ex-
dard deviation ratio and various information based measures           ample, [Ptaszynski et al., 2017] showed that CNNs benefit
like class entropy. These measures are defined in the STAT-           from higher FD while other classifiers’ score was usually
LOG project [King et al., 1995].                                      higher when using lower FD datasets. This suggests that it
   This meta-learning approach was taken further by [Ben-             could be possible to improve the performance of CNNs by
susan and Kalousis, 2001] who introduced the Landmark-                increasing the FD of the applied dataset, while other classi-
ing method, using learners themselves to characterize the             fiers could achieve higher scores by lowering FD [Ptaszynski
datasets. This means using computationally non-demanding              et al., 2017].
classifiers, like Naive Bayes (NB), to obtain important in-              In practice, we attempt to estimate what feature engineer-
sights about the datasets. The method outperformed the pre-           ing methods can achieve the highest performance for differ-
vious characterization method and had moderate success in             ent models in different languages. The method lets us ignore
ranking learners.                                                     redundant feature sets for a particular classifier or language
   Later, [Blachnik, 2017] improved on the Landmarking                and only keep the ones with the highest performance poten-
method by proposing the use of information from instance              tial without actually training any models.
selection methods as landmarks. These instance selection
methods are most commonly used for cleaning the dataset               2.3    Linguistically-backed Preprocessing
reducing it size by removing redundant information. They              Almost without exception, the word embeddings are learned
discovered that the relation between the original and reduced         from pure tokens (words) or lemmas (unconjugated forms of
datasets can be used as a landmark to lower the error rates           words). This is also the case with the recently popularized
when predicting classifier performance.                               pre-trained language models like BERT [Devlin et al., 2018].
   Another approach to predicting classifier performance is to        To the best of our knowledge, embeddings backed with lin-
extrapolate results from a smaller dataset to simulate the per-       guistic information have not yet been researched extensively,
formance of a larger dataset. [Basavanhally et al., 2010] at-         with only a handful of related work attempting to explore the
tempted to predict classifier performance in the field of com-        subject [Levy and Goldberg, 2014; Komninos and Manand-
puter aided diagnostics, where data is very often limited in          har, 2016; Cotterell and Schütze, 2019].
quantity. Their experiments showed that using a repeated ran-            To further investigate the potential of capturing deeper re-
dom sampling method on small datasets to make predictions             lations between lexical items and structures and to filter out
on a larger set tended to have high error rates and should not        redundant information, we propose to preserve the morpho-
be generalized as holding true when large amounts of data be-         logical, syntactic and other types of information by adding
come available. Later, [Basavanhally et al., 2015] improved           linguistic information to the pure tokens or lemmas. This
this method by utilizing it together with cross-validation sam-       means, for example, including parts-of-speech or dependency
pling strategy, which resulted in lower error rates.                  information within the used lexical features. These combina-
   In the field of NLP, [Johnson et al., 2018] applied the ex-        tions would then be used to train the word embeddings. The
trapolation method to document classification using the fast-         method could be later applied to the pre-training of huge lan-
Text classifier. They discovered that biased power law model          guage models to possibly improve their performance. The
with binomial weights works as a good baseline extrapolation          preprocessing methods are described in-depth in section 3.2.
model for NLP tasks.
   Instead of concentrating on meta information of the dataset        3     Dataset and Learners
or performance simulation, our research directly targets fea-
ture engineering and the relation between the available fea-          3.1    Dataset
ture space and classifier performance. This novel method that         We tested the concept of FD on the Kaggle Formspring
can be utilized together with the existing methods to better          Dataset for Cyberbullying Detection [Reynolds et al., 2011].
estimate the performance of different classifiers.                    However, the original dataset had a problem of being anno-
                                                                      tated by laypeople, whereas it has been pointed out before
2.2    Feature Density                                                that datasets for topics such as online harassment and cy-
The concept of Feature Density (FD) was introduced by                 berbullying should be annotated by experts [Ptaszynski and
[Ptaszynski et al., 2017] based on the notion of Lexical Den-         Masui, 2018]. Therefore in our research we applied the ver-
sity [Ure, 1971] from linguistics. It is a score representing         sion of the dataset after re-annotation with the help of highly
an estimated measure of content per lexical units for a given         trained data annotators with sufficient psychological back-
corpus, calculated as the number of all unique words divided          ground to assure high quality of annotations [Ptaszynski et


                                                                  6
      Table 1: Statistics of the dataset after improved annotation.          • Lemmatization: like the above but with generic (dictio-
                                                                               nary) forms of words (“lemmas”) (later: LEM).
 Element type                                       Value                    • Parts of speech (separate): parts of speech information
 Number of samples                                 12,772                      is added in the form of separate features (later: POSS).
 Number of CB samples                                 913                    • Parts of speech (combined): parts of speech informa-
 Number of non-CB samples                          11,859                      tion is merged with other applied features (later: POS).
 Number of all tokens                             301,198                    • Named Entity Recognition (without replacement):
 Number of unique tokens                           18,394                      information on what named entities (private name of a
 Avg. length (chars) of a post (Q+A)                 12.1                      person, organization, numericals, etc.) appear in the sen-
 Avg. length (words) of a post (Q+A)                 23.6                      tence are added to the applied word (later: NER).
 Avg. length (chars) of a question                   61.6                    • Named Entity Recognition (with replacement): same
 Avg. length (words) of a question                     12                      as above but information replaces the applied word
 Avg. length (chars) of a answer                     58.5                      (later: NERR).
 Avg. length (words) of a answer                     11.5                    • Dependency structure: noun- and verb-phrases with
 Avg. length (chars) of a CB post                    12.1                      syntactic relations between them (later: DEP).
 Avg. length (words) of a CB post                    22.9                    • Chunking: like above but without dependency relations
 Avg. length (chars) of a non-CB post                13.9                      (“chunks”, later: CHNK).
 Avg. length (words) of a non-CB post                24.7                    • Stopword filtering: redundant words are filtered out us-
                                                                               ing spaCy’s stopword list for English (later: STOP)
                                                                             • Filtering of non-alphabetics: non-alphabetic charac-
al., 2018]. Cyberbullying is a phenomenon observed in many                     ters are filtered out (later: ALPHA)
SNS. It is defined as using online means of communication to
harass and/or humiliate individuals. This can include slurry              3.3   Feature Extraction
comments about someone’s looks or personality or spreading                We generated a Bag-of-Words language model from each
sensitive or false information about individuals. This prob-              of the 68 processed dataset versions. This resulted in sepa-
lem has existed throughout the time of communication via                  rate models for each of the datasets (Bag-of-Words, Bag-of-
Internet between people but has grown extensively with the                Lemmas, Bag-of-POS, etc.). Next, we applied a weighting
advent of communication devices that can be used on-the-                  scheme, term frequency with inverse document frequency or
go like smartphones and tablets. Users’ realization of the                tf ∗ idf .
anonymity of online communications is one of the factors that                When training a Convolutional Neural Network model, the
make this activity attractive for bullies since they rarely face          embeddings were trained as a part of the network for all of the
consequences of their improper behavior [Bull, 2010]. The                 described datasets. Similarly to other classifiers, we trained a
problem has been growing with the popularity of SNS.                      separate model for each of the 68 datasets (Word/token Em-
   Table 1 reports some key statistics of the current annota-             beddings, Lemmas Embeddings, POS Embeddings, Chunks
tion of the dataset. The dataset contains approximately 300               Embeddings, etc.). The embeddings were trained as part of
thousand of tokens. There were no visible differences in                  the network using Keras’ embedding layer with random ini-
length between the posted questions and answers (approx. 12               tial weights, meaning no pretraining was used.
words). On the other hand, the harmful (CB) samples were
usually slightly shorter than the non-harmful (non-CB) sam-               3.4   Classification
ples (approx. 23 vs. 25 words). The number of harmful sam-
ples was small, amounting to 7%, which roughly reflects the               We used two variants of Support Vector Machine [Cortes and
amount of profanity on SNS [Ptaszynski and Masui, 2018].                  Vapnik, 1995], linear-SVM and linear-SVM with SGD op-
                                                                          timizer. We also used two different solvers for Logistic Re-
3.2     Preprocessing                                                     gression (LR), Newton and L-BFGS. We also used both Ad-
                                                                          aBoost [Freund and Schapire, 1997] and XGBoost [Chen and
In order to train the linguistically-backed embeddings, we                Guestrin, 2016]. Other classifiers applied include Random
first preprocessed the dataset in various ways, similarly to              Forest [Breiman, 2001], kNN, Naı̈veBayes, Multilayer Per-
[Ptaszynski et al., 2017]. This was done to verify the cor-               ceptron (MLP) And Convolutional Neural Network (CNN).
relation between the classification results and Feature Den-                 In this experiment MLP refers to a network using regu-
sity (FD) and to verify the performance of various versions of            lar dense layers. We applied an MLP implementation with
the proposed linguistically-backed embeddings. The prepro-                Rectified Linear Units (ReLU) as a neuron activation func-
cessing was done using spaCy NLP toolkit (https://spacy.io/).             tion [Hinton et al., 2012] and one hidden layer with dropout
After assembling combinations from the listed preprocessing               regularization which reduces overfitting and improves gener-
types, we ended up with a total of 68 possible preprocessing              alization by randomly dropping out some of the hidden units
methods for the experiments. The FDs for all separate pre-                during training [Hinton et al., 2012].
processing types used in this research were shown in Table                   We applied a CNN implementation with Rectified Linear
2.                                                                        Units (ReLU) as a neuron activation function, and max pool-
    • Tokenization: includes words, punctuation marks, etc.               ing [Scherer et al., 2010], which applies a max filter to non-
      separated by spaces (later: TOK).                                   overlying sub-parts of the input to reduce dimensionality and


                                                                      7
                                                               Table 3: Classifiers with best F1, preprocessing type and Pearson’s
                                                               correlation coefficient for FD and F1.

   Table 2: Feature Density of preprocessing types.                  Classifier     Best F1           Best PP type   ρ(F1, FD)
                                                                     SGD SVM           .798             TOKPOS          -.8239
Preprocessing type      Uniq.1grams   All1grams       FD             MLP             .7958                 TOK          -.8599
                                                                     Linear SVM      .7941          TOKPOSSTOP            -.834
POS                              18     357616    .0001              L-BFGS LR       .7932             TOKSTOP          -.8024
POSALPHA                         18     357616    .0001              Newton LR       .7915          TOKNERSTOP          -.8097
POSSTOP                          18     194606    .0001              RandomForest    .7582             TOKSTOP          -.7873
POSSTOPALPHA                     17     129076    .0001              XGBoost         .7523             LEMSTOP          -.8303
LEMPOSSALPHA                  17875     579664    .0308              CNN1            .7406             DEPSTOP           .1633
LEMPOSS                       21238     660653    .0321              CNN2            .7357             LEMPOSS           .0951
TOKPOSSALPHA                  21737     579624    .0375              AdaBoost        .7356             TOKSTOP          -.7362
TOKPOSS                       25122     660612      .038             NaiveBayes      .7165                 TOK          -.7531
LEMNERALPHA                   14815     289868    .0511              KNN             .6711    TOKPOSSSTOPALPHA          -.7116
LEMNERR                       17327     309124    .0561
CHNKNERRALPHA                 12293     215096    .0572
LEMNERRALPHA                  17877     305481    .0585
CHNKNERALPHA                  14007     228146    .0614        in effect correct overfitting. We also applied dropout regu-
LEMALPHA                      17860     289868    .0616        larization on penultimate layer. We applied two versions of
LEMPOSSSTOP                   20948     334870    .0626
TOKNERRALPHA                  18595     289828    .0642        CNN. First, with one hidden convolutional layer containing
CHNKALPHA                     13991     215096      .065       128 units. The second version consisted of two hidden con-
LEMNER                        21239     325173    .0653
LEMPOSSSTOPALPHA              17554     258103     .068
                                                               volutional layers, containing 128 feature maps each, with 4x4
TOKNERR                       21119     309084    .0683        size of patch and 2x2 max-pooling, and Adaptive Moment
LEM                           21222     308434    .0688        Estimation (Adam), a variant of Stochastic Gradient Descent
TOKNERALPHA                   21737     305441    .0712
TOKPOSSSTOP                   24472     334869    .0731
                                                               [LeCun et al., 2012].
LEMPOS                        26232     357657    .0733
TOKALPHA                      21722     289828    .0749
LEMPOSALPHA                   22206     289868    .0766        4     Experiments
TOKNER                        25121     325132    .0773
TOK                           25106     308393    .0814        4.1    Setup
TOKPOSSSTOPALPHA              21037     258103    .0815
TOKPOS                        31121     357616      .087       The preprocessed dataset provides 68 separate datasets and
TOKPOSALPHA                   27013     289828    .0932        the experiment was performed once for each preprocessing
LEMNERRSTOPALPHA              14509     129076    .1124
LEMNERRSTOP                   17047     146549    .1163
                                                               type. Each of the classifiers (sect. 3.4) were tested on each
LEMNERSTOPALPHA               17557     142289    .1234        version of the dataset in a 10-fold cross validation proce-
CHNKNERR                      33025     262529    .1258        dure. This gives us an opportunity to evaluate how effec-
LEMNERRSTOPALPHA              20950     160269    .1307
LEMPOSSTOP                    25669     194674    .1319        tive different preprocessing methods are for each classifier.
LEMSTOPALPHA                  17540     129076    .1359        As the dataset was not balanced, we oversampled the minor-
TOKNERRSTOPALPHA              17911     129076    .1387        ity class using Synthetic Minority Over-sampling Technique
CHNKNER                       38044     272581    .1396
TOKNERRSTOP                   20480     146549    .1397        (SMOTE) [Chawla et al., 2002]. The preprocessing methods
LEMSTOP                       20933     145866    .1435        represent a wide range of Feature Densities, which can be
CHNKNERSTOPALPHA              13356      92782     .144
CHNKNERRSTOPALPHA             11656      80896    .1441
                                                               used to evaluate the correlation with classifier performance.
CHNK                          38029     261990    .1452
TOKNERSTOPALPHA               21037     142289    .1478        4.2    Effect of Feature Density
TOKNERSTOP                    24471     160268    .1527
TOKPOSSTOP                    30040     194673    .1543        We analyzed the correlation of Feature Density with each
TOKSTOPALPHA                  21022     129076    .1629        of the classifiers using the proposed preprocessing methods.
CHNKSTOPALPHA                 13340      80896    .1649
LEMPOSSTOPALPHA               21626     129076    .1675        The results are represented in Table 5. As the results for us-
TOKSTOP                       24456     145865    .1677        ing only parts-of-speech tags, which had the lowest FD by
TOKPOSSTOPALPHA               25925     129076    .2009
CHNKNERRSTOP                  32452     126357    .2568
                                                               far, were extremely low (close to a coinflip). Thus, we can
CHNKNERSTOP                   37462     135357    .2768        say that POS tags alone do not contain enough information to
CHNKSTOP                      37447     125824    .2976        successfully classify the entries.
DEPNERALPHA                   95404     240302      .397
DEPNERRALPHA                  94928     215096    .4413           After excluding the preprocessing methods that only used
DEPALPHA                      95386     215096    .4435        POS tags, we can see that all classifiers, except CNNs have
DEPNER                       143197     321835    .4449
DEPNERSTOPALPHA               47159     104940    .4494
                                                               a strong negative correlation with Feature Density. So these
DEPNERR                      141479     308704    .4583        classifiers seem to have a weaker performance if a lot of lin-
DEP                          143179     308704    .4638        guistic information is added, and the best results being usu-
DEPNERSTOP                    94539     184130    .5134
DEPNERRSTOP                   92730     172086    .5389        ally within the range of .05 to .15 FD depending on the clas-
DEPSTOP                       94521     172086    .5493        sifier. This range includes 38 of the 68 preprocessing meth-
DEPNERRSTOPALPHA              46552      80896    .5755        ods (Table 2), meaning that the total training time could be
DEPSTOPALPHA                  47141      80896    .5827
                                                               reduced by around 40-50%. This can be seen from, for exam-
                                                               ple, the highest performing classifier, SVM with SGD opti-
                                                               mizer (Figure 1), where the maximum classifier performance
                                                               starts high at around .05 and slowly falls until .14 after which


                                                           8
there is a noticeable drop. The performance only falls further          and thus it should be only used as a supplement to other meth-
as the FD rises.                                                        ods.
   For CNNs however, there was a very weak positive or no                  Stopword filtering seemed to be the one of the most effec-
correlation between FD and the classifier performance, with             tive preprocessing techniques for traditional classifiers, which
the higher FD datasets performing equally or even slightly              can be seen from Table 3 as it was used in the majority of
better when comparing to the low FD datasets. Taking a look             the highest scores. The problem with stopwords was that
at one layer CNN’s performance, which was better than the               the scores fluctuated a lot, having both low and very high
CNN with two layers, we can see from Figure 1 that the max-             scores and scoring high mostly with Logistic Regression and
imum performance starts at a moderate level and stays more              all of the tree based classifiers. An important thing to note is
stable throughout the whole range of feature densities. The             that the preprocessing method had extremely polarized per-
most potential ranges of FD are between .05 to .1 and after             formance with CNNs, scoring either very high or low. Over-
.45. The potential training time reduction seems to be similar,         all, stopwords yielded the most top scores of any preprocess-
around 40-50% The reduction in training time could be es-               ing method considering all the classifiers.
pecially important when considering demanding models like                  Another very effective preprocessing method was Parts of
Neural Networks.                                                        Speech merging (POS), which achieved high performance
   The results suggest that for non-CNN classifiers there is            overall when added to TOK or LEM. The method also got
no need to consider preprocessings with a high FD, such as              the highest scores with multiple classifiers, especially SVMs.
chunking or dependencies, as they had a considerably lower              Adding parts-of-speech information to the respective words
performance. The performance seems to start falling rapidly             achieved a higher score than using them as a separate feature.
at around F D = .15 with most of the classifiers. For CNNs,             This keeps the information directly connected to the word it-
high performances were recorded on both low and high FDs.               self, which seems a better option when preserving informa-
This means that there is potential in the higher FD prepro-             tion.
cessing types, namely, dependencies for CNNs.                              Using Named Entity Recognition reduced the classifier
   The reason for CNNs relatively low performance could be              performance most of the time, only achieving a high score
explained by the relatively small size of the dataset, especially       with one classifier, Newton-LR. The performance of using
when considering the amount of actual cyberbullying entries,            NER seemed clearly inferior compared to stopwords or POS
as adding even a second layer to the network already caused             information. Replacing words with their NER information
a loss of the most valuable features and ended up degrad-               seems to cause too much information loss and reduces the
ing performance. With such small amount of data, it doesn’t             performance when comparing to plain tokens. Attaching
seem useful to train deep learning models to solve the clas-            NER information to the respective words did not improve the
sification problem. Still, the dependency based features are            performance in most cases but still performed better than re-
showing potential with CNNs. With a considerably larger                 placement. These results are different to [Ptaszynski et al.,
dataset and more computational power, it could be possible              2017], who noticed that NER helped most of the times for
to outperform other classifiers and the usage of tokens with            cyberbullying (CB) detection in Japanese. This could come
dependency based features when using deep learning.                     from the fact that CB is differently realized in those lan-
   The experiments show that changing Feature Density in                guages. In Japan, revealing victim’s personal information,
moderate amount can yield good results when using other                 or “doxxing” is known to be one of the most often used form
classifiers than CNNs. However, excessive changes to ei-                of bullying, thus NER, which can pin-point information such
ther too low or too high always showed diminishing results.             as address or phone number often help in classification, while
The treshold was in all cases approximately between 50%                 this is not the case in English.
and 200% of the original density (TOK), most optimal FDs                   Filtering out non-alphabetic characters also reduced the
only slightly varying with each classifier. The exception be-           classifier performance most of the time and also got a high
ing Random Forest [Breiman, 2001], which showed a clear                 score with only one classifier, kNN, which was the weakest
spike at around .12 FD. As the usage of high Feature Density            classifier overall. Non-alphabetic tokens seem to carry useful
datasets showed potential with CNNs, their usage needs to               information, at least in the context of cyberbullying detec-
be confirmed in future research. Also, more exact ideal fea-            tion, as removing them reduced the performance comparing
ture densities need to be confirmed for each classifier using           to plain tokens due to information loss.
datasets of different sizes and fields to make a more accurate             Trying to generalize the feature set ended up lowering the
ranking of classifiers by FD possible.                                  results in most cases with the exception of the very high
                                                                        scores of stopword filtering using traditional classifiers. This
4.3   Analysis of Linguistically-backed                                 would mean that the stopword filter sometimes succeeded
      Preprocessing                                                     in removing noise and outliers from the dataset while other
From the results it can be seen that most of the classifiers            generalization methods ended up cutting useful information.
scored highest on pure tokens. CNNs also performed quite                Adding information to tokens could be useful in some scenar-
well on the dependency-based preprocessings. Using lemmas               ios as was shown with parts-of-speech tags and using depen-
usually got slightly lower scores than tokens probably due to           dency information with CNNs, although using NER was not
information loss. Chunking got low performance overall and              so successful. Any kind of generalization attempt resulted in
was clearly outperformed by dependency-based features in                a lower performance with CNNs, which shows their ability
CNNs. Using only POS tags achieved very low performance                 to assemble more complex patterns from tokens and relations


                                                                    9
Table 4: Approximate power usage of the training processes. Non-           car in the European Union in 2019, according to EEA, emits
neural classifiers: i9 7920X, 163W. Neural classifiers: GTX 1080ti,        around 122 g CO2 e per kilometer driven. So when training a
250W. Expecting 100% power usage.
                                                                           simple CNN model, if we calculate the feature densities and
                                                                           leaving out the weaker feature sets before training, we could
         Classifier      Runtime (s)   Power usage (Wh)   Best F1
                                                                           save as much as driving a new car for almost 50 kilometers in
         SGD SVM             176.26              79.81       .798          emissions.
         MLP               53845.89           37392.98     .7958
         Linear SVM         1543.06             698.67     .7941              Instead of having to run all of the experiments, it could be
         L-BFGS LR            321.6             145.61     .7932           useful to first discard the FD ranges of the overall weakest
         Newton LR           249.74             113.08     .7915
         Random Forest      3982.49            1803.18     .7582
                                                                           feature sets. Then running a small subset of the experiments
         XGBoost           17917.74            8112.76     .7523           with a set interval between preprocessing type feature den-
         CNN1              62361.45           43306.56     .7406           sities, look for the FD range with a high performance and
         CNN2              62054.46           43093.37     .7357
         AdaBoost           10425.4            4720.39     .7356           iterate around it by running more experiments with similar
         Naive Bayes          97.54              44.16     .7165           feature densities in order to find the maximum performance.
         KNN                 556.44             251.94     .6711

                                                                           5   Conclusions
that are unusable by other classifiers.
                                                                           In this paper we presented our research on Feature Density
   An interesting discovery is that using raw tokens only
                                                                           and linguistically-backed preprocessing methods, applied in
rarely resulted in the best performance considering the pro-
                                                                           dialog classification and cyberbullying detection. Both con-
posed feature sets. This can be seen from Tables 3 and 5. This
                                                                           cepts are relatively novel to the field. We studied the effect of
proves the effectiveness of using linguistics-based feature en-
                                                                           FD in reducing the number of required experiments iterations
gineering instead of directly using words as features. Also,
                                                                           and analyzed the usage of different linguistically-backed pre-
the performance of one-layer CNN increased significantly
                                                                           processing methods in the context of CB detection.
when using linguistic embeddings, from 0.659 (TOK) F-score
to 0.741 (DEPSTOP). The high scores of dependency-based                       The results indicate that for non-CNN classifiers, there is
feature sets indicate that structural information could be im-             an ideal Feature Density that slightly differs between each
portant.                                                                   classifier. This can be taken into account in future experi-
   In order to compare the usage of linguistic preprocessing               ments in order to save time and computational power when
to modern text classifiers, we fine-tuned RoBERTa [Liu et al.,             running experiments. For CNNs however, there is almost no
2019] on the dataset. This showed an F-score of 0.797, which               correlation between FD and classifier and thus the higher FD
is similar to the highest scores by other models using our                 datasets should also be considered when trying to achieve the
method. Actually, the best score by SGD SVM is 0.798 which                 best performance.
is slightly higher. It is fascinating that a simple method like               Using plain tokens to keep the original words and their
SVM can outperform a complex modern text classifier when                   forms and reducing noise with stopword filtering yielded the
using the right feature set. This shows that traditional, more             best results in general. With some classifiers, adding extra
simple models should not be underestimated as with correct                 information in the form of POS tags also proved useful. For
preparations, they can achieve a similar performance as state-             convolutional neural networks, using dependency based in-
of-the art models and require much less computational power.               formation showed potential and their effect needs to be con-
Possibly, the performance of pretrained language models like               firmed in future research.
RoBERTa could also be increased by feature engineering and                    Although the environmental effect of the method does not
applying embeddings with linguistic information. This needs                seem very significant here, one has to keep in mind that the
to be explored further in future research.                                 tested models were quite simple. Assuming that the method
                                                                           would work with other datasets and more resource intensive
4.4     Environmental Effect                                               classifiers, the savings could be very significant. It could
If the weaker feature sets were to be left out, the power sav-             be useful to only run a subset of the experiments and iter-
ings are approximately 35Wh calculated from Table 4 for                    ate around the most probable performance peak in order to
training the SGD SVM classifier, which is not very much.                   find the maximum performance.
But the classifier was very power efficient to train to begin.                In the near future we will also confirm the potential of
A more impressive result can be seen with CNN, where the                   linguistically-backed preprocessing and Feature Density for
power savings are approximately 21kWh, which is consider-                  other applications and languages. The research further sug-
ably more compared to SVM.                                                 gests that adding linguistic preprocessing can improve the
   In order demonstrate the environmental effect of the                    performance of classifiers, which needs to be also confirmed
method, we will look at the CNN model and its power savings                on current state of the art language models.
(21kWh). According to European Environmental Agency
(EEA) 1 , the average CO2 emissions of electricity generation              References
was 275 g CO2 e/kWh in 2019. Thus the greenhouse gases
emitted during the training of CNN could be estimated to be                [Basavanhally et al., 2010] A. Basavanhally, S. Doyle, and
5.8 kg CO2 e. For comparison, the average new passenger                      A. Madabhushi. Predicting classifier performance with a
                                                                             small training set: Applications to computer-aided diagno-
   1
       https://www.eea.europa.eu/                                            sis and prognosis. In 2010 IEEE International Symposium


                                                                      10
   on Biomedical Imaging: From Nano to Macro, pages 229–              [Johnson et al., 2018] Mark Johnson, Peter Anderson, Mark
   232, 2010.                                                            Dras, and Mark Steedman. Predicting accuracy on large
[Basavanhally et al., 2015] Ajay Basavanhally,          Satish           datasets from smaller pilot data. In Proceedings of the 56th
   Viswanath, and Anant Madabhushi. Predicting classifier                Annual Meeting of the Association for Computational Lin-
   performance with limited training data: Applications to               guistics (Volume 2: Short Papers), pages 450–455, Mel-
   computer-aided diagnosis in breast and prostate cancer.               bourne, Australia, July 2018. Association for Computa-
   PLOS ONE, 10(5):1–18, 05 2015.                                        tional Linguistics.
[Bensusan and Kalousis, 2001] Hilan Bensusan and Alexan-              [King et al., 1995] R. D. King, C. Feng, and A. Suther-
   dros Kalousis. Estimating the predictive accuracy of a                land. Statlog: Comparison of classification algorithms on
   classifier. In Luc De Raedt and Peter Flach, editors, Ma-             large real-world problems. Applied Artificial Intelligence,
   chine Learning: ECML 2001, pages 25–36, Berlin, Hei-                  9(3):289–333, 1995.
   delberg, 2001. Springer Berlin Heidelberg.                         [Komninos and Manandhar, 2016] Alexandros Komninos
[Blachnik, 2017] Marcin Blachnik. Instance selection for                 and Suresh Manandhar. Dependency based embeddings
   classifier performance estimation in meta learning. En-               for sentence classification tasks. In Proceedings of the
   tropy, 19:583, 11 2017.                                               2016 Conference of the North American Chapter of
                                                                         the Association for Computational Linguistics: Human
[Breiman, 2001] Leo Breiman. Random forests. Machine
                                                                         Language Technologies, pages 1490–1500, San Diego,
   Learning, 45(1):5–32, Oct 2001.                                       California, June 2016. Association for Computational
[Bull, 2010] Glen Bull.       The always-connected genera-               Linguistics.
   tion. Learning and Leading with Technology, 38:28–29,              [LeCun et al., 2012] Yann LeCun, Leon Bottou, Genevieve
   November 2010.                                                        Orr., and Klaus-Robert Muller. Efficient BackProp, pages
[Chawla et al., 2002] Nitesh Chawla, Kevin Bowyer,                       9–48. Springer Berlin Heidelberg, Berlin, Heidelberg,
   Lawrence Hall, and Philip Kegelmeyer. Smote: Synthetic                2012.
   minority over-sampling technique. J. Artif. Int. Res.,             [Levy and Goldberg, 2014] Omer Levy and Yoav Goldberg.
   16(1):321–357, June 2002.                                             Dependency-based word embeddings. In ACL, 2014.
[Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin.            [Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal,
   Xgboost: A scalable tree boosting system.           CoRR,             Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
   abs/1603.02754, 2016.                                                 Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta:
[Cortes and Vapnik, 1995] Corinna Cortes and Vladimir                    A robustly optimized bert pretraining approach. arXiv
   Vapnik. Support-vector networks. Machine Learning,                    preprint arXiv:1907.11692, 2019.
   20(3):273–297, Sep 1995.                                           [Michie et al., 1995] Donald Michie, D. J. Spiegelhalter,
[Cotterell and Schütze, 2019] Ryan Cotterell and Hinrich                C. C. Taylor, and John Campbell, editors. Machine Learn-
   Schütze. Morphological word embeddings. CoRR,                        ing, Neural and Statistical Classification. Ellis Horwood,
   abs/1907.02423, 2019.                                                 USA, 1995.
[Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken-              [Ptaszynski and Masui, 2018] Michal Ptaszynski and Fu-
   ton Lee, and Kristina Toutanova. Bert: Pre-training of                mito Masui. Automatic Cyberbullying Detection: Emerg-
   deep bidirectional transformers for language understand-              ing Research and Opportunities. IGI Global, 2018.
   ing, 2018.                                                         [Ptaszynski et al., 2017] Michal        Ptaszynski,      Juuso
[Freund and Schapire, 1997] Yoav Freund and Robert E                     Kalevi Kristian Eronen, and Fumito Masui. Learn-
   Schapire. A decision-theoretic generalization of on-line              ing deep on cyberbullying is always better than brute
   learning and an application to boosting. Journal of Com-              force. In LaCATODA 2017 CEUR Workshop Proceedings,
   puter and System Sciences, 55(1):119 – 139, 1997.                     page 3–10, 2017.
[Gama and Brazdil, 1995] J. Gama and P. Brazdil. Charac-              [Ptaszynski et al., 2018] Michał Ptaszynski,         Gniewosz
   terization of classification algorithms. In Carlos Pinto-             Leliwa, Mateusz Piech, and Aleksander Smywiński-Pohl.
   Ferreira and Nuno J. Mamede, editors, Progress in Ar-                 Cyberbullying detection – technical report 2/2018, depart-
   tificial Intelligence, pages 189–200, Berlin, Heidelberg,             ment of computer science agh, university of science and
   1995. Springer Berlin Heidelberg.                                     technology, 2018.
[Hinduja and Patchin, 2010] Sameer Hinduja and Justin                 [Reynolds et al., 2011] Kelly Reynolds, April Edwards, and
   Patchin. Bullying, cyberbullying, and suicide. Archives               Lynne Edwards. Using machine learning to detect cyber-
   of suicide research : official journal of the International           bullying. Proceedings - 10th International Conference on
   Academy for Suicide Research, 14:206–21, 07 2010.                     Machine Learning and Applications, ICMLA 2011, 2, 12
[Hinton et al., 2012] Geoffrey E. Hinton, Nitish Srivastava,             2011.
   Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-              [Scherer et al., 2010] Dominik Scherer, Andread Müller,
   dinov. Improving neural networks by preventing co-                    and Sven Behnke. Evaluation of pooling operations in con-
   adaptation of feature detectors. CoRR, abs/1207.0580,                 volutional architectures for object recognition. In ICANN
   2012.                                                                 2010 Proceedings, Part III, pages 92–101, 01 2010.


                                                                 11
[Schwartz et al., 2019] Roy Schwartz, Jesse Dodge, Noah A.
   Smith, and Oren Etzioni.            Green AI.      CoRR,
   abs/1907.10597, 2019.
[Strubell et al., 2019] Emma Strubell, Ananya Ganesh, and
   Andrew McCallum. Energy and policy considerations for
   deep learning in NLP. CoRR, abs/1906.02243, 2019.
[Ure, 1971] J Ure. Lexical density and register differentia-
   tion. Applications of Linguistics, page 443–452, 1971.


                                                               12
Table 5: F1 for all preprocessings & classifiers; best classifier for each dataset in bold; best preprocessing for each underlined


                                                                                                                       RandomForest
                                                               Linear SVM


                                                                                                       NaiveBayes
                                                Newton LR
                                  LBFGS LR


                                                                               SGD SVM


                                                                                                                                         AdaBoost


                                                                                                                                                        XGBoost


                                                                                                                                                                              CNN1


                                                                                                                                                                                         CNN2
                                                                                              KNN


                                                                                                                                                                     MLP
     CHNK                      0.727         0.726          0.718           0.736         0.57      0.674           0.613             0.649         0.667         0.724    0.657     0.666
     CHNKNERR                  0.688         0.695          0.702           0.699         0.58      0.653           0.603             0.608         0.642         0.704    0.645     0.662
     CHNKNERRALPHA              0.66         0.663          0.651           0.657        0.603      0.626           0.616             0.599         0.653         0.674    0.566        0.6
     CHNKNERRSTOP              0.686         0.684          0.684           0.694        0.577      0.629           0.635             0.621         0.652         0.693    0.402     0.344
     CHNKNERRSTOPALPHA         0.618         0.617          0.591           0.607        0.404      0.598            0.62             0.582         0.648         0.623    0.451      0.34
     CHNKNER                   0.718         0.723          0.721           0.737        0.582      0.669           0.603              0.63         0.673         0.722    0.654     0.642
     CHNKNERALPHA              0.675         0.676          0.663           0.663        0.599      0.641           0.618             0.609         0.649         0.684    0.557     0.614
     CHNKNERSTOP               0.724         0.724          0.715           0.724        0.582      0.663           0.635             0.652         0.679          0.72    0.501     0.298
     CHNKNERSTOPALPHA          0.666         0.661          0.644           0.668        0.386      0.615           0.659             0.625         0.656         0.647    0.431     0.406
     CHNKALPHA                 0.684         0.681          0.669           0.683        0.607      0.643           0.647             0.616         0.676         0.695    0.587     0.583
     CHNKSTOP                  0.722         0.721          0.711           0.723        0.577       0.67           0.667             0.648         0.679         0.715    0.386     0.342
     CHNKSTOPALPHA             0.629         0.637          0.606           0.619        0.395      0.608           0.649             0.654         0.664         0.628    0.455     0.374
     DEP                       0.617         0.619          0.568           0.587        0.243      0.617           0.536             0.566         0.598         0.594    0.682     0.694
     DEPNERR                    0.61         0.614          0.571           0.587        0.241      0.611           0.533             0.562         0.596         0.595     0.67     0.695
     DEPNERRALPHA              0.606         0.605          0.589           0.602        0.312      0.596           0.537             0.556         0.595         0.593    0.585     0.622
     DEPNERRSTOP               0.602         0.599          0.564           0.568        0.273      0.615           0.543             0.572            0.6        0.578    0.726     0.702
     DEPNERRSTOPALPHA          0.584         0.584           0.56           0.581        0.386      0.599           0.544             0.561         0.595         0.574    0.583     0.619
     DEPNER                    0.624         0.621          0.574           0.585        0.242      0.611           0.528             0.564         0.595         0.592    0.686     0.692
     DEPNERALPHA               0.585         0.589          0.561           0.579        0.213      0.607           0.578             0.497         0.593         0.603    0.606     0.623
     DEPNERSTOP                0.611         0.602          0.564           0.576        0.274      0.604           0.527             0.563         0.604         0.577    0.725     0.708
     DEPNERSTOPALPHA           0.535         0.531          0.523           0.523        0.297      0.543           0.563             0.422         0.576         0.564     0.63     0.632
     DEPALPHA                  0.609         0.612          0.588           0.601        0.314         0.6          0.545             0.552         0.604         0.598    0.606      0.62
     DEPSTOP                   0.606         0.595          0.562           0.571        0.276      0.616           0.544             0.576         0.603         0.584    0.741     0.648
     DEPSTOPALPHA              0.586         0.587          0.564           0.588        0.388      0.594           0.539             0.568         0.595         0.578    0.629     0.625
     LEM                       0.781         0.786          0.784            0.79        0.634      0.715           0.724              0.72         0.744         0.786     0.67     0.665
     LEMNERR                    0.74         0.737          0.742            0.74        0.601      0.692           0.697             0.683         0.724         0.749    0.658     0.663
     LEMNERRALPHA              0.729         0.728          0.725           0.725        0.614      0.685           0.699              0.68          0.71          0.74    0.645     0.652
     LEMNERRSTOP               0.737         0.734          0.726           0.732        0.609      0.682           0.727              0.69          0.72         0.741    0.371     0.364
     LEMNERRSTOPALPHA          0.732         0.732          0.714           0.727        0.624      0.674           0.723             0.682         0.704         0.737    0.372     0.348
     LEMPOSS                   0.764         0.765          0.769           0.767        0.564      0.713           0.658             0.679         0.717         0.773    0.662     0.736
     LEMPOSSALPHA               0.76         0.758          0.753           0.758        0.406      0.705           0.669             0.674         0.712         0.756    0.603     0.715
     LEMPOSSSTOP               0.763         0.766          0.767           0.774        0.566      0.709           0.706             0.691          0.72         0.773    0.683     0.725
     LEMPOSSSTOPALPHA          0.762         0.766          0.748           0.765         0.49      0.702           0.713             0.681         0.714         0.757    0.593     0.716
     LEMNER                    0.784         0.782          0.787           0.792        0.631       0.71           0.716              0.72         0.742          0.78     0.68     0.613
     LEMNERALPHA               0.763         0.764          0.765           0.767        0.637      0.699            0.71             0.707         0.742         0.768    0.662     0.671
     LEMNERSTOP                0.782         0.783          0.782           0.792        0.634      0.706           0.745             0.725         0.742          0.78    0.429     0.378
     LEMNERSTOPALPHA            0.77         0.767          0.752           0.767         0.64      0.693           0.739             0.716         0.738         0.768     0.46     0.414
     LEMPOS                    0.778         0.778          0.788            0.79        0.517      0.711           0.663             0.727         0.741         0.783    0.665      0.64
     LEMPOSALPHA               0.768         0.772          0.772           0.768        0.522         0.7          0.654             0.713         0.727         0.775    0.664     0.695
     LEMPOSSTOP                 0.78         0.781          0.788           0.788        0.642      0.708           0.708             0.721         0.735         0.783    0.715     0.707
     LEMPOSSTOPALPHA            0.77         0.769          0.766           0.768        0.669      0.696           0.718             0.722          0.73         0.778    0.669     0.698
     LEMALPHA                  0.755         0.764          0.745           0.765        0.294      0.703           0.718             0.705         0.748         0.754     0.61     0.651
     LEMSTOP                   0.787         0.786          0.784           0.791        0.641      0.713           0.754             0.732         0.752         0.789    0.403     0.327
     LEMSTOPALPHA              0.772         0.766          0.766           0.773        0.357      0.702           0.747             0.712         0.745         0.764    0.377     0.329
     POSS                      0.487         0.487          0.488           0.491        0.522      0.498           0.556             0.509         0.555         0.488     0.54     0.536
     POSSALPHA                 0.488         0.486          0.488           0.498        0.526      0.498           0.552             0.518         0.549         0.493    0.538     0.534
     POSSSTOP                  0.477         0.477          0.471           0.467        0.518      0.486            0.54             0.496         0.533         0.484    0.431     0.434
     POSSSTOPALPHA             0.469          0.47          0.471           0.465        0.517      0.478           0.525             0.484         0.511         0.491    0.428     0.484
     TOK                       0.793         0.788          0.793           0.796        0.632      0.716           0.711             0.728         0.748         0.796    0.659     0.661
     TOKNERR                   0.741         0.744          0.737           0.743           0.6     0.696           0.688             0.671         0.719         0.749    0.655     0.631
     TOKNERRALPHA              0.734         0.735          0.735            0.73        0.624      0.683           0.681             0.674         0.704         0.748    0.626     0.655
     TOKNERRSTOP               0.736         0.736          0.728           0.732        0.609       0.68            0.73             0.678          0.71         0.751    0.406     0.317
     TOKNERRSTOPALPHA          0.728         0.731          0.727           0.723        0.623      0.675           0.721              0.68         0.698         0.744    0.412     0.394
     TOKPOSS                   0.766         0.768          0.767           0.783        0.549      0.715           0.648             0.671         0.715         0.773    0.686     0.729
     TOKPOSSALPHA              0.765         0.761          0.763           0.767        0.378      0.709           0.662             0.656         0.709         0.769    0.643     0.658
     TOKPOSSSTOP               0.763         0.765          0.767           0.773        0.563      0.704           0.703             0.684         0.724         0.771    0.675     0.722
     TOKPOSSSTOPALPHA          0.774         0.773          0.774           0.771        0.671      0.694           0.722             0.713          0.73         0.779     0.68     0.698
     TOKNER                    0.789         0.785          0.788           0.789        0.609      0.708           0.703             0.722         0.745         0.784    0.684      0.68
     TOKNERALPHA               0.768         0.771          0.763           0.776        0.628      0.696           0.701             0.705         0.746         0.775    0.649     0.648
     TOKNERSTOP                0.785         0.791           0.79            0.79        0.635      0.703           0.732             0.721         0.743          0.79    0.444     0.367
     TOKNERSTOPALPHA           0.773         0.771          0.762           0.774        0.646      0.691           0.737             0.704          0.74         0.771    0.371     0.379
     TOKPOS                    0.781         0.783          0.791           0.798        0.565      0.713           0.656              0.72         0.739         0.787    0.626     0.705
     TOKPOSALPHA               0.775         0.775          0.778           0.784        0.576      0.699           0.653             0.705         0.731         0.783    0.633     0.698
     TOKPOSSTOP                0.786         0.783          0.794           0.792        0.645         0.7          0.711             0.733         0.739         0.789    0.706     0.691
     TOKPOSSTOPALPHA           0.759         0.766          0.756           0.762        0.458      0.696           0.706             0.679         0.674         0.601    0.734     0.718
     TOKALPHA                  0.768         0.768          0.757           0.773        0.271      0.705           0.721             0.705         0.742         0.756    0.643     0.652
     TOKSTOP                   0.793          0.79          0.784           0.794        0.644      0.708           0.758             0.736         0.749         0.787    0.355     0.321
     TOKSTOPALPHA              0.775         0.776          0.766           0.776        0.342         0.7          0.745             0.714         0.744         0.765    0.452     0.425


                                                                                         13
     (a) Alphabetic filtering (red) vs others (blue)              (b) Alphabetic filtering (red) vs others (blue)


            (c) NER (red) vs others (blue)                               (d) NER (red) vs others (blue)


             (e) POS (red) vs others (blue)                               (f) POS (red) vs others (blue)


      (g) Stopword filtering (red) vs others (blue)                (h) Stopword filtering (red) vs others (blue)


(i) TOK (red), LEM (green), CHNK (yellow), DEP (blue)        (j) TOK (red), LEM (green), CHNK (yellow), DEP (blue)

                           Figure 1: FD & F1 score for SGD SVM (left) and CNN1 (right)

                                                        14

</pre>