=Paper=
{{Paper
|id=Vol-2935/paper1
|storemode=property
|title=Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection
|pdfUrl=https://ceur-ws.org/Vol-2935/paper1.pdf
|volume=Vol-2935
|authors=Juuso Eronen,Michal Ptaszynski,Fumito Masui,Gniewosz Leliwa,Michal Wroczynski
}}
==Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection==
Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection Juuso Eronen1 , Michal Ptaszynski1 , Fumito Masui1 , Gniewosz Leliwa2 and Michal Wroczynski2 1 Kitami Institute of Technology, Japan 2 Samurai Labs, Poland eronen.juuso@gmail.com, {ptaszynski, f-masui}@cs.kitami-it.ac.jp, {gniewosz.leliwa, michal.wroczynski}@samurailabs.ai Abstract the classifier performance this way, it is multiple times more costly. In this research, we analyze the potential of Feature Previously, there have been some attempts to estimate the Density (FD) as a way to comparatively estimate performance of a ML model before any training. One pro- machine learning (ML) classifier performance prior posal to this problem is using meta-learning and training a to training. The goal of the study is to aid in solving model using dataset characteristics to estimate classifier per- the problem of resource-intensive training of ML formance [Gama and Brazdil, 1995]. Another approach is models which is becoming a serious issue due to extrapolating results from small datasets to simulate the per- continuously increasing dataset sizes and the ever formance using larger datasets [Basavanhally et al., 2010]. rising popularity of Deep Neural Networks (DNN). The importance of resolving this issue comes not only from The issue of constantly increasing demands for the increased computational requirements, but also from its more powerful computational resources is also af- environmental effect. This is directly caused by the increased fecting the environment, as training large-scale ML popularity of the fields of Artificial Intelligence (AI) and ML. models are causing alarmingly-growing amounts of Training classifiers on large datasets is both time consuming CO2 emissions. Our approach is to optimize the and computationally intensive while leaving behind a notice- resource-intensive training of ML models for Nat- able carbon footprint [Strubell et al., 2019]. To move towards ural Language Processing to reduce the number of greener AI [Schwartz et al., 2019], it is necessary to inspect required experiments iterations. We expand on pre- the core of ML methods and find potential points of improve- vious attempts on improving classifier training ef- ment. In order to save computational power and reduce emis- ficiency with FD while also providing an insight sions, it would be useful to roughly estimate classifier perfor- to the effectiveness of various linguistically-backed mance prior to training. feature preprocessing methods for dialog classifica- The ability to estimate classifier performance before the tion, specifically cyberbullying detection. training would also have important practical implications. In dialog agent applications, one of the areas where the need for this is becoming more urgent is in forum moderation, 1 Introduction specifically the detection of harmful and abusive behaviour One of the challenges in machine learning (ML) has always observed online, known as cyberbullying (CB). The number been estimating how well different classification algorithms of CB cases has been constantly growing since the increase will perform with a given dataset. Although there are classi- of the popularity of Social Networking Services (SNS) [Hin- fiers that tend to be highly effective on a variety of different duja and Patchin, 2010; Ptaszynski and Masui, 2018]. The problems, they might be easily outperformed by others on a consequences of unattended cases of online abuse are known dataset specific scale. As it is difficult to identify a classifier to be serious, leading the victims to self mutilation, or even that would perform best with every kind of dataset [Michie et suicides, or on the opposite, to attacking their offenders in al., 1995], it comes down to the user (researcher, or ML prac- revenge. Being able to roughly estimate which classifier set- titioner) to determine experimentally, which classifier could tings can be rejected, would make the process of implemen- be appropriate based on their knowledge of the field and pre- tation of automatic cyberbullying detection for various lan- vious experiences. guages and social networking platforms more efficient. A common way when estimating the performance of dif- To contribute to that, we conduct an in-depth analysis of ferent classifiers is to select a variety of possible classifiers the effectiveness of FD proposed previously by [Ptaszynski et to experiment on and train them using cross-validation to al., 2017] to comparatively estimate the performance of dif- aid in getting the best possible average estimations of their ferent classifiers before training. We also analyze the effec- performances. With a sufficiently small dataset and using a tiveness of various linguistically-backed feature preprocess- computationally efficient algorithm, this approach works very ing methods, including lemmas, Named Entity Recognition well. Even though it is possible to get accurate estimations of (NER) and dependency information-based features, with an Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 5 application to automatic cyberbullying detection. by the number of all words in the corpus. The score is called Feature Density as it also includes other features, like parts- 2 Previous Research of-speech or dependency information, in addition to words. In this research, after calculating FD for all applied dataset 2.1 Classifier Performance Estimation preprocessing methods we calculated Pearson’s correlation [Gama and Brazdil, 1995] proposed that classifier perfor- coefficient (ρ-value) between dataset generalization (FD) and mance could be estimated by training a regression model classifier results (F-scores). If ideal ranges of FD can be based on meta-level characteristics of a dataset. The charac- identified, or FD has a positive or negative correlation with teristics used included simple measures like number of exam- classifier performance, it could be useful in comparatively ples and number of attributes, statistical measures like stan- estimating the performance of various classifiers. For ex- dard deviation ratio and various information based measures ample, [Ptaszynski et al., 2017] showed that CNNs benefit like class entropy. These measures are defined in the STAT- from higher FD while other classifiers’ score was usually LOG project [King et al., 1995]. higher when using lower FD datasets. This suggests that it This meta-learning approach was taken further by [Ben- could be possible to improve the performance of CNNs by susan and Kalousis, 2001] who introduced the Landmark- increasing the FD of the applied dataset, while other classi- ing method, using learners themselves to characterize the fiers could achieve higher scores by lowering FD [Ptaszynski datasets. This means using computationally non-demanding et al., 2017]. classifiers, like Naive Bayes (NB), to obtain important in- In practice, we attempt to estimate what feature engineer- sights about the datasets. The method outperformed the pre- ing methods can achieve the highest performance for differ- vious characterization method and had moderate success in ent models in different languages. The method lets us ignore ranking learners. redundant feature sets for a particular classifier or language Later, [Blachnik, 2017] improved on the Landmarking and only keep the ones with the highest performance poten- method by proposing the use of information from instance tial without actually training any models. selection methods as landmarks. These instance selection methods are most commonly used for cleaning the dataset 2.3 Linguistically-backed Preprocessing reducing it size by removing redundant information. They Almost without exception, the word embeddings are learned discovered that the relation between the original and reduced from pure tokens (words) or lemmas (unconjugated forms of datasets can be used as a landmark to lower the error rates words). This is also the case with the recently popularized when predicting classifier performance. pre-trained language models like BERT [Devlin et al., 2018]. Another approach to predicting classifier performance is to To the best of our knowledge, embeddings backed with lin- extrapolate results from a smaller dataset to simulate the per- guistic information have not yet been researched extensively, formance of a larger dataset. [Basavanhally et al., 2010] at- with only a handful of related work attempting to explore the tempted to predict classifier performance in the field of com- subject [Levy and Goldberg, 2014; Komninos and Manand- puter aided diagnostics, where data is very often limited in har, 2016; Cotterell and Schütze, 2019]. quantity. Their experiments showed that using a repeated ran- To further investigate the potential of capturing deeper re- dom sampling method on small datasets to make predictions lations between lexical items and structures and to filter out on a larger set tended to have high error rates and should not redundant information, we propose to preserve the morpho- be generalized as holding true when large amounts of data be- logical, syntactic and other types of information by adding come available. Later, [Basavanhally et al., 2015] improved linguistic information to the pure tokens or lemmas. This this method by utilizing it together with cross-validation sam- means, for example, including parts-of-speech or dependency pling strategy, which resulted in lower error rates. information within the used lexical features. These combina- In the field of NLP, [Johnson et al., 2018] applied the ex- tions would then be used to train the word embeddings. The trapolation method to document classification using the fast- method could be later applied to the pre-training of huge lan- Text classifier. They discovered that biased power law model guage models to possibly improve their performance. The with binomial weights works as a good baseline extrapolation preprocessing methods are described in-depth in section 3.2. model for NLP tasks. Instead of concentrating on meta information of the dataset 3 Dataset and Learners or performance simulation, our research directly targets fea- ture engineering and the relation between the available fea- 3.1 Dataset ture space and classifier performance. This novel method that We tested the concept of FD on the Kaggle Formspring can be utilized together with the existing methods to better Dataset for Cyberbullying Detection [Reynolds et al., 2011]. estimate the performance of different classifiers. However, the original dataset had a problem of being anno- tated by laypeople, whereas it has been pointed out before 2.2 Feature Density that datasets for topics such as online harassment and cy- The concept of Feature Density (FD) was introduced by berbullying should be annotated by experts [Ptaszynski and [Ptaszynski et al., 2017] based on the notion of Lexical Den- Masui, 2018]. Therefore in our research we applied the ver- sity [Ure, 1971] from linguistics. It is a score representing sion of the dataset after re-annotation with the help of highly an estimated measure of content per lexical units for a given trained data annotators with sufficient psychological back- corpus, calculated as the number of all unique words divided ground to assure high quality of annotations [Ptaszynski et 6 Table 1: Statistics of the dataset after improved annotation. • Lemmatization: like the above but with generic (dictio- nary) forms of words (“lemmas”) (later: LEM). Element type Value • Parts of speech (separate): parts of speech information Number of samples 12,772 is added in the form of separate features (later: POSS). Number of CB samples 913 • Parts of speech (combined): parts of speech informa- Number of non-CB samples 11,859 tion is merged with other applied features (later: POS). Number of all tokens 301,198 • Named Entity Recognition (without replacement): Number of unique tokens 18,394 information on what named entities (private name of a Avg. length (chars) of a post (Q+A) 12.1 person, organization, numericals, etc.) appear in the sen- Avg. length (words) of a post (Q+A) 23.6 tence are added to the applied word (later: NER). Avg. length (chars) of a question 61.6 • Named Entity Recognition (with replacement): same Avg. length (words) of a question 12 as above but information replaces the applied word Avg. length (chars) of a answer 58.5 (later: NERR). Avg. length (words) of a answer 11.5 • Dependency structure: noun- and verb-phrases with Avg. length (chars) of a CB post 12.1 syntactic relations between them (later: DEP). Avg. length (words) of a CB post 22.9 • Chunking: like above but without dependency relations Avg. length (chars) of a non-CB post 13.9 (“chunks”, later: CHNK). Avg. length (words) of a non-CB post 24.7 • Stopword filtering: redundant words are filtered out us- ing spaCy’s stopword list for English (later: STOP) • Filtering of non-alphabetics: non-alphabetic charac- al., 2018]. Cyberbullying is a phenomenon observed in many ters are filtered out (later: ALPHA) SNS. It is defined as using online means of communication to harass and/or humiliate individuals. This can include slurry 3.3 Feature Extraction comments about someone’s looks or personality or spreading We generated a Bag-of-Words language model from each sensitive or false information about individuals. This prob- of the 68 processed dataset versions. This resulted in sepa- lem has existed throughout the time of communication via rate models for each of the datasets (Bag-of-Words, Bag-of- Internet between people but has grown extensively with the Lemmas, Bag-of-POS, etc.). Next, we applied a weighting advent of communication devices that can be used on-the- scheme, term frequency with inverse document frequency or go like smartphones and tablets. Users’ realization of the tf ∗ idf . anonymity of online communications is one of the factors that When training a Convolutional Neural Network model, the make this activity attractive for bullies since they rarely face embeddings were trained as a part of the network for all of the consequences of their improper behavior [Bull, 2010]. The described datasets. Similarly to other classifiers, we trained a problem has been growing with the popularity of SNS. separate model for each of the 68 datasets (Word/token Em- Table 1 reports some key statistics of the current annota- beddings, Lemmas Embeddings, POS Embeddings, Chunks tion of the dataset. The dataset contains approximately 300 Embeddings, etc.). The embeddings were trained as part of thousand of tokens. There were no visible differences in the network using Keras’ embedding layer with random ini- length between the posted questions and answers (approx. 12 tial weights, meaning no pretraining was used. words). On the other hand, the harmful (CB) samples were usually slightly shorter than the non-harmful (non-CB) sam- 3.4 Classification ples (approx. 23 vs. 25 words). The number of harmful sam- ples was small, amounting to 7%, which roughly reflects the We used two variants of Support Vector Machine [Cortes and amount of profanity on SNS [Ptaszynski and Masui, 2018]. Vapnik, 1995], linear-SVM and linear-SVM with SGD op- timizer. We also used two different solvers for Logistic Re- 3.2 Preprocessing gression (LR), Newton and L-BFGS. We also used both Ad- aBoost [Freund and Schapire, 1997] and XGBoost [Chen and In order to train the linguistically-backed embeddings, we Guestrin, 2016]. Other classifiers applied include Random first preprocessed the dataset in various ways, similarly to Forest [Breiman, 2001], kNN, Naı̈veBayes, Multilayer Per- [Ptaszynski et al., 2017]. This was done to verify the cor- ceptron (MLP) And Convolutional Neural Network (CNN). relation between the classification results and Feature Den- In this experiment MLP refers to a network using regu- sity (FD) and to verify the performance of various versions of lar dense layers. We applied an MLP implementation with the proposed linguistically-backed embeddings. The prepro- Rectified Linear Units (ReLU) as a neuron activation func- cessing was done using spaCy NLP toolkit (https://spacy.io/). tion [Hinton et al., 2012] and one hidden layer with dropout After assembling combinations from the listed preprocessing regularization which reduces overfitting and improves gener- types, we ended up with a total of 68 possible preprocessing alization by randomly dropping out some of the hidden units methods for the experiments. The FDs for all separate pre- during training [Hinton et al., 2012]. processing types used in this research were shown in Table We applied a CNN implementation with Rectified Linear 2. Units (ReLU) as a neuron activation function, and max pool- • Tokenization: includes words, punctuation marks, etc. ing [Scherer et al., 2010], which applies a max filter to non- separated by spaces (later: TOK). overlying sub-parts of the input to reduce dimensionality and 7 Table 3: Classifiers with best F1, preprocessing type and Pearson’s correlation coefficient for FD and F1. Table 2: Feature Density of preprocessing types. Classifier Best F1 Best PP type ρ(F1, FD) SGD SVM .798 TOKPOS -.8239 Preprocessing type Uniq.1grams All1grams FD MLP .7958 TOK -.8599 Linear SVM .7941 TOKPOSSTOP -.834 POS 18 357616 .0001 L-BFGS LR .7932 TOKSTOP -.8024 POSALPHA 18 357616 .0001 Newton LR .7915 TOKNERSTOP -.8097 POSSTOP 18 194606 .0001 RandomForest .7582 TOKSTOP -.7873 POSSTOPALPHA 17 129076 .0001 XGBoost .7523 LEMSTOP -.8303 LEMPOSSALPHA 17875 579664 .0308 CNN1 .7406 DEPSTOP .1633 LEMPOSS 21238 660653 .0321 CNN2 .7357 LEMPOSS .0951 TOKPOSSALPHA 21737 579624 .0375 AdaBoost .7356 TOKSTOP -.7362 TOKPOSS 25122 660612 .038 NaiveBayes .7165 TOK -.7531 LEMNERALPHA 14815 289868 .0511 KNN .6711 TOKPOSSSTOPALPHA -.7116 LEMNERR 17327 309124 .0561 CHNKNERRALPHA 12293 215096 .0572 LEMNERRALPHA 17877 305481 .0585 CHNKNERALPHA 14007 228146 .0614 in effect correct overfitting. We also applied dropout regu- LEMALPHA 17860 289868 .0616 larization on penultimate layer. We applied two versions of LEMPOSSSTOP 20948 334870 .0626 TOKNERRALPHA 18595 289828 .0642 CNN. First, with one hidden convolutional layer containing CHNKALPHA 13991 215096 .065 128 units. The second version consisted of two hidden con- LEMNER 21239 325173 .0653 LEMPOSSSTOPALPHA 17554 258103 .068 volutional layers, containing 128 feature maps each, with 4x4 TOKNERR 21119 309084 .0683 size of patch and 2x2 max-pooling, and Adaptive Moment LEM 21222 308434 .0688 Estimation (Adam), a variant of Stochastic Gradient Descent TOKNERALPHA 21737 305441 .0712 TOKPOSSSTOP 24472 334869 .0731 [LeCun et al., 2012]. LEMPOS 26232 357657 .0733 TOKALPHA 21722 289828 .0749 LEMPOSALPHA 22206 289868 .0766 4 Experiments TOKNER 25121 325132 .0773 TOK 25106 308393 .0814 4.1 Setup TOKPOSSSTOPALPHA 21037 258103 .0815 TOKPOS 31121 357616 .087 The preprocessed dataset provides 68 separate datasets and TOKPOSALPHA 27013 289828 .0932 the experiment was performed once for each preprocessing LEMNERRSTOPALPHA 14509 129076 .1124 LEMNERRSTOP 17047 146549 .1163 type. Each of the classifiers (sect. 3.4) were tested on each LEMNERSTOPALPHA 17557 142289 .1234 version of the dataset in a 10-fold cross validation proce- CHNKNERR 33025 262529 .1258 dure. This gives us an opportunity to evaluate how effec- LEMNERRSTOPALPHA 20950 160269 .1307 LEMPOSSTOP 25669 194674 .1319 tive different preprocessing methods are for each classifier. LEMSTOPALPHA 17540 129076 .1359 As the dataset was not balanced, we oversampled the minor- TOKNERRSTOPALPHA 17911 129076 .1387 ity class using Synthetic Minority Over-sampling Technique CHNKNER 38044 272581 .1396 TOKNERRSTOP 20480 146549 .1397 (SMOTE) [Chawla et al., 2002]. The preprocessing methods LEMSTOP 20933 145866 .1435 represent a wide range of Feature Densities, which can be CHNKNERSTOPALPHA 13356 92782 .144 CHNKNERRSTOPALPHA 11656 80896 .1441 used to evaluate the correlation with classifier performance. CHNK 38029 261990 .1452 TOKNERSTOPALPHA 21037 142289 .1478 4.2 Effect of Feature Density TOKNERSTOP 24471 160268 .1527 TOKPOSSTOP 30040 194673 .1543 We analyzed the correlation of Feature Density with each TOKSTOPALPHA 21022 129076 .1629 of the classifiers using the proposed preprocessing methods. CHNKSTOPALPHA 13340 80896 .1649 LEMPOSSTOPALPHA 21626 129076 .1675 The results are represented in Table 5. As the results for us- TOKSTOP 24456 145865 .1677 ing only parts-of-speech tags, which had the lowest FD by TOKPOSSTOPALPHA 25925 129076 .2009 CHNKNERRSTOP 32452 126357 .2568 far, were extremely low (close to a coinflip). Thus, we can CHNKNERSTOP 37462 135357 .2768 say that POS tags alone do not contain enough information to CHNKSTOP 37447 125824 .2976 successfully classify the entries. DEPNERALPHA 95404 240302 .397 DEPNERRALPHA 94928 215096 .4413 After excluding the preprocessing methods that only used DEPALPHA 95386 215096 .4435 POS tags, we can see that all classifiers, except CNNs have DEPNER 143197 321835 .4449 DEPNERSTOPALPHA 47159 104940 .4494 a strong negative correlation with Feature Density. So these DEPNERR 141479 308704 .4583 classifiers seem to have a weaker performance if a lot of lin- DEP 143179 308704 .4638 guistic information is added, and the best results being usu- DEPNERSTOP 94539 184130 .5134 DEPNERRSTOP 92730 172086 .5389 ally within the range of .05 to .15 FD depending on the clas- DEPSTOP 94521 172086 .5493 sifier. This range includes 38 of the 68 preprocessing meth- DEPNERRSTOPALPHA 46552 80896 .5755 ods (Table 2), meaning that the total training time could be DEPSTOPALPHA 47141 80896 .5827 reduced by around 40-50%. This can be seen from, for exam- ple, the highest performing classifier, SVM with SGD opti- mizer (Figure 1), where the maximum classifier performance starts high at around .05 and slowly falls until .14 after which 8 there is a noticeable drop. The performance only falls further and thus it should be only used as a supplement to other meth- as the FD rises. ods. For CNNs however, there was a very weak positive or no Stopword filtering seemed to be the one of the most effec- correlation between FD and the classifier performance, with tive preprocessing techniques for traditional classifiers, which the higher FD datasets performing equally or even slightly can be seen from Table 3 as it was used in the majority of better when comparing to the low FD datasets. Taking a look the highest scores. The problem with stopwords was that at one layer CNN’s performance, which was better than the the scores fluctuated a lot, having both low and very high CNN with two layers, we can see from Figure 1 that the max- scores and scoring high mostly with Logistic Regression and imum performance starts at a moderate level and stays more all of the tree based classifiers. An important thing to note is stable throughout the whole range of feature densities. The that the preprocessing method had extremely polarized per- most potential ranges of FD are between .05 to .1 and after formance with CNNs, scoring either very high or low. Over- .45. The potential training time reduction seems to be similar, all, stopwords yielded the most top scores of any preprocess- around 40-50% The reduction in training time could be es- ing method considering all the classifiers. pecially important when considering demanding models like Another very effective preprocessing method was Parts of Neural Networks. Speech merging (POS), which achieved high performance The results suggest that for non-CNN classifiers there is overall when added to TOK or LEM. The method also got no need to consider preprocessings with a high FD, such as the highest scores with multiple classifiers, especially SVMs. chunking or dependencies, as they had a considerably lower Adding parts-of-speech information to the respective words performance. The performance seems to start falling rapidly achieved a higher score than using them as a separate feature. at around F D = .15 with most of the classifiers. For CNNs, This keeps the information directly connected to the word it- high performances were recorded on both low and high FDs. self, which seems a better option when preserving informa- This means that there is potential in the higher FD prepro- tion. cessing types, namely, dependencies for CNNs. Using Named Entity Recognition reduced the classifier The reason for CNNs relatively low performance could be performance most of the time, only achieving a high score explained by the relatively small size of the dataset, especially with one classifier, Newton-LR. The performance of using when considering the amount of actual cyberbullying entries, NER seemed clearly inferior compared to stopwords or POS as adding even a second layer to the network already caused information. Replacing words with their NER information a loss of the most valuable features and ended up degrad- seems to cause too much information loss and reduces the ing performance. With such small amount of data, it doesn’t performance when comparing to plain tokens. Attaching seem useful to train deep learning models to solve the clas- NER information to the respective words did not improve the sification problem. Still, the dependency based features are performance in most cases but still performed better than re- showing potential with CNNs. With a considerably larger placement. These results are different to [Ptaszynski et al., dataset and more computational power, it could be possible 2017], who noticed that NER helped most of the times for to outperform other classifiers and the usage of tokens with cyberbullying (CB) detection in Japanese. This could come dependency based features when using deep learning. from the fact that CB is differently realized in those lan- The experiments show that changing Feature Density in guages. In Japan, revealing victim’s personal information, moderate amount can yield good results when using other or “doxxing” is known to be one of the most often used form classifiers than CNNs. However, excessive changes to ei- of bullying, thus NER, which can pin-point information such ther too low or too high always showed diminishing results. as address or phone number often help in classification, while The treshold was in all cases approximately between 50% this is not the case in English. and 200% of the original density (TOK), most optimal FDs Filtering out non-alphabetic characters also reduced the only slightly varying with each classifier. The exception be- classifier performance most of the time and also got a high ing Random Forest [Breiman, 2001], which showed a clear score with only one classifier, kNN, which was the weakest spike at around .12 FD. As the usage of high Feature Density classifier overall. Non-alphabetic tokens seem to carry useful datasets showed potential with CNNs, their usage needs to information, at least in the context of cyberbullying detec- be confirmed in future research. Also, more exact ideal fea- tion, as removing them reduced the performance comparing ture densities need to be confirmed for each classifier using to plain tokens due to information loss. datasets of different sizes and fields to make a more accurate Trying to generalize the feature set ended up lowering the ranking of classifiers by FD possible. results in most cases with the exception of the very high scores of stopword filtering using traditional classifiers. This 4.3 Analysis of Linguistically-backed would mean that the stopword filter sometimes succeeded Preprocessing in removing noise and outliers from the dataset while other From the results it can be seen that most of the classifiers generalization methods ended up cutting useful information. scored highest on pure tokens. CNNs also performed quite Adding information to tokens could be useful in some scenar- well on the dependency-based preprocessings. Using lemmas ios as was shown with parts-of-speech tags and using depen- usually got slightly lower scores than tokens probably due to dency information with CNNs, although using NER was not information loss. Chunking got low performance overall and so successful. Any kind of generalization attempt resulted in was clearly outperformed by dependency-based features in a lower performance with CNNs, which shows their ability CNNs. Using only POS tags achieved very low performance to assemble more complex patterns from tokens and relations 9 Table 4: Approximate power usage of the training processes. Non- car in the European Union in 2019, according to EEA, emits neural classifiers: i9 7920X, 163W. Neural classifiers: GTX 1080ti, around 122 g CO2 e per kilometer driven. So when training a 250W. Expecting 100% power usage. simple CNN model, if we calculate the feature densities and leaving out the weaker feature sets before training, we could Classifier Runtime (s) Power usage (Wh) Best F1 save as much as driving a new car for almost 50 kilometers in SGD SVM 176.26 79.81 .798 emissions. MLP 53845.89 37392.98 .7958 Linear SVM 1543.06 698.67 .7941 Instead of having to run all of the experiments, it could be L-BFGS LR 321.6 145.61 .7932 useful to first discard the FD ranges of the overall weakest Newton LR 249.74 113.08 .7915 Random Forest 3982.49 1803.18 .7582 feature sets. Then running a small subset of the experiments XGBoost 17917.74 8112.76 .7523 with a set interval between preprocessing type feature den- CNN1 62361.45 43306.56 .7406 sities, look for the FD range with a high performance and CNN2 62054.46 43093.37 .7357 AdaBoost 10425.4 4720.39 .7356 iterate around it by running more experiments with similar Naive Bayes 97.54 44.16 .7165 feature densities in order to find the maximum performance. KNN 556.44 251.94 .6711 5 Conclusions that are unusable by other classifiers. In this paper we presented our research on Feature Density An interesting discovery is that using raw tokens only and linguistically-backed preprocessing methods, applied in rarely resulted in the best performance considering the pro- dialog classification and cyberbullying detection. Both con- posed feature sets. This can be seen from Tables 3 and 5. This cepts are relatively novel to the field. We studied the effect of proves the effectiveness of using linguistics-based feature en- FD in reducing the number of required experiments iterations gineering instead of directly using words as features. Also, and analyzed the usage of different linguistically-backed pre- the performance of one-layer CNN increased significantly processing methods in the context of CB detection. when using linguistic embeddings, from 0.659 (TOK) F-score to 0.741 (DEPSTOP). The high scores of dependency-based The results indicate that for non-CNN classifiers, there is feature sets indicate that structural information could be im- an ideal Feature Density that slightly differs between each portant. classifier. This can be taken into account in future experi- In order to compare the usage of linguistic preprocessing ments in order to save time and computational power when to modern text classifiers, we fine-tuned RoBERTa [Liu et al., running experiments. For CNNs however, there is almost no 2019] on the dataset. This showed an F-score of 0.797, which correlation between FD and classifier and thus the higher FD is similar to the highest scores by other models using our datasets should also be considered when trying to achieve the method. Actually, the best score by SGD SVM is 0.798 which best performance. is slightly higher. It is fascinating that a simple method like Using plain tokens to keep the original words and their SVM can outperform a complex modern text classifier when forms and reducing noise with stopword filtering yielded the using the right feature set. This shows that traditional, more best results in general. With some classifiers, adding extra simple models should not be underestimated as with correct information in the form of POS tags also proved useful. For preparations, they can achieve a similar performance as state- convolutional neural networks, using dependency based in- of-the art models and require much less computational power. formation showed potential and their effect needs to be con- Possibly, the performance of pretrained language models like firmed in future research. RoBERTa could also be increased by feature engineering and Although the environmental effect of the method does not applying embeddings with linguistic information. This needs seem very significant here, one has to keep in mind that the to be explored further in future research. tested models were quite simple. Assuming that the method would work with other datasets and more resource intensive 4.4 Environmental Effect classifiers, the savings could be very significant. It could If the weaker feature sets were to be left out, the power sav- be useful to only run a subset of the experiments and iter- ings are approximately 35Wh calculated from Table 4 for ate around the most probable performance peak in order to training the SGD SVM classifier, which is not very much. find the maximum performance. But the classifier was very power efficient to train to begin. In the near future we will also confirm the potential of A more impressive result can be seen with CNN, where the linguistically-backed preprocessing and Feature Density for power savings are approximately 21kWh, which is consider- other applications and languages. The research further sug- ably more compared to SVM. gests that adding linguistic preprocessing can improve the In order demonstrate the environmental effect of the performance of classifiers, which needs to be also confirmed method, we will look at the CNN model and its power savings on current state of the art language models. (21kWh). According to European Environmental Agency (EEA) 1 , the average CO2 emissions of electricity generation References was 275 g CO2 e/kWh in 2019. Thus the greenhouse gases emitted during the training of CNN could be estimated to be [Basavanhally et al., 2010] A. Basavanhally, S. Doyle, and 5.8 kg CO2 e. For comparison, the average new passenger A. Madabhushi. Predicting classifier performance with a small training set: Applications to computer-aided diagno- 1 https://www.eea.europa.eu/ sis and prognosis. In 2010 IEEE International Symposium 10 on Biomedical Imaging: From Nano to Macro, pages 229– [Johnson et al., 2018] Mark Johnson, Peter Anderson, Mark 232, 2010. Dras, and Mark Steedman. Predicting accuracy on large [Basavanhally et al., 2015] Ajay Basavanhally, Satish datasets from smaller pilot data. In Proceedings of the 56th Viswanath, and Anant Madabhushi. Predicting classifier Annual Meeting of the Association for Computational Lin- performance with limited training data: Applications to guistics (Volume 2: Short Papers), pages 450–455, Mel- computer-aided diagnosis in breast and prostate cancer. bourne, Australia, July 2018. Association for Computa- PLOS ONE, 10(5):1–18, 05 2015. tional Linguistics. [Bensusan and Kalousis, 2001] Hilan Bensusan and Alexan- [King et al., 1995] R. D. King, C. Feng, and A. Suther- dros Kalousis. Estimating the predictive accuracy of a land. Statlog: Comparison of classification algorithms on classifier. In Luc De Raedt and Peter Flach, editors, Ma- large real-world problems. Applied Artificial Intelligence, chine Learning: ECML 2001, pages 25–36, Berlin, Hei- 9(3):289–333, 1995. delberg, 2001. Springer Berlin Heidelberg. [Komninos and Manandhar, 2016] Alexandros Komninos [Blachnik, 2017] Marcin Blachnik. Instance selection for and Suresh Manandhar. Dependency based embeddings classifier performance estimation in meta learning. En- for sentence classification tasks. In Proceedings of the tropy, 19:583, 11 2017. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human [Breiman, 2001] Leo Breiman. Random forests. Machine Language Technologies, pages 1490–1500, San Diego, Learning, 45(1):5–32, Oct 2001. California, June 2016. Association for Computational [Bull, 2010] Glen Bull. The always-connected genera- Linguistics. tion. Learning and Leading with Technology, 38:28–29, [LeCun et al., 2012] Yann LeCun, Leon Bottou, Genevieve November 2010. Orr., and Klaus-Robert Muller. Efficient BackProp, pages [Chawla et al., 2002] Nitesh Chawla, Kevin Bowyer, 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg, Lawrence Hall, and Philip Kegelmeyer. Smote: Synthetic 2012. minority over-sampling technique. J. Artif. Int. Res., [Levy and Goldberg, 2014] Omer Levy and Yoav Goldberg. 16(1):321–357, June 2002. Dependency-based word embeddings. In ACL, 2014. [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. [Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal, Xgboost: A scalable tree boosting system. CoRR, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike abs/1603.02754, 2016. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: [Cortes and Vapnik, 1995] Corinna Cortes and Vladimir A robustly optimized bert pretraining approach. arXiv Vapnik. Support-vector networks. Machine Learning, preprint arXiv:1907.11692, 2019. 20(3):273–297, Sep 1995. [Michie et al., 1995] Donald Michie, D. J. Spiegelhalter, [Cotterell and Schütze, 2019] Ryan Cotterell and Hinrich C. C. Taylor, and John Campbell, editors. Machine Learn- Schütze. Morphological word embeddings. CoRR, ing, Neural and Statistical Classification. Ellis Horwood, abs/1907.02423, 2019. USA, 1995. [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken- [Ptaszynski and Masui, 2018] Michal Ptaszynski and Fu- ton Lee, and Kristina Toutanova. Bert: Pre-training of mito Masui. Automatic Cyberbullying Detection: Emerg- deep bidirectional transformers for language understand- ing Research and Opportunities. IGI Global, 2018. ing, 2018. [Ptaszynski et al., 2017] Michal Ptaszynski, Juuso [Freund and Schapire, 1997] Yoav Freund and Robert E Kalevi Kristian Eronen, and Fumito Masui. Learn- Schapire. A decision-theoretic generalization of on-line ing deep on cyberbullying is always better than brute learning and an application to boosting. Journal of Com- force. In LaCATODA 2017 CEUR Workshop Proceedings, puter and System Sciences, 55(1):119 – 139, 1997. page 3–10, 2017. [Gama and Brazdil, 1995] J. Gama and P. Brazdil. Charac- [Ptaszynski et al., 2018] Michał Ptaszynski, Gniewosz terization of classification algorithms. In Carlos Pinto- Leliwa, Mateusz Piech, and Aleksander Smywiński-Pohl. Ferreira and Nuno J. Mamede, editors, Progress in Ar- Cyberbullying detection – technical report 2/2018, depart- tificial Intelligence, pages 189–200, Berlin, Heidelberg, ment of computer science agh, university of science and 1995. Springer Berlin Heidelberg. technology, 2018. [Hinduja and Patchin, 2010] Sameer Hinduja and Justin [Reynolds et al., 2011] Kelly Reynolds, April Edwards, and Patchin. Bullying, cyberbullying, and suicide. Archives Lynne Edwards. Using machine learning to detect cyber- of suicide research : official journal of the International bullying. Proceedings - 10th International Conference on Academy for Suicide Research, 14:206–21, 07 2010. Machine Learning and Applications, ICMLA 2011, 2, 12 [Hinton et al., 2012] Geoffrey E. Hinton, Nitish Srivastava, 2011. Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- [Scherer et al., 2010] Dominik Scherer, Andread Müller, dinov. Improving neural networks by preventing co- and Sven Behnke. Evaluation of pooling operations in con- adaptation of feature detectors. CoRR, abs/1207.0580, volutional architectures for object recognition. In ICANN 2012. 2010 Proceedings, Part III, pages 92–101, 01 2010. 11 [Schwartz et al., 2019] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. CoRR, abs/1907.10597, 2019. [Strubell et al., 2019] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP. CoRR, abs/1906.02243, 2019. [Ure, 1971] J Ure. Lexical density and register differentia- tion. Applications of Linguistics, page 443–452, 1971. 12 Table 5: F1 for all preprocessings & classifiers; best classifier for each dataset in bold; best preprocessing for each underlined RandomForest Linear SVM NaiveBayes Newton LR LBFGS LR SGD SVM AdaBoost XGBoost CNN1 CNN2 KNN MLP CHNK 0.727 0.726 0.718 0.736 0.57 0.674 0.613 0.649 0.667 0.724 0.657 0.666 CHNKNERR 0.688 0.695 0.702 0.699 0.58 0.653 0.603 0.608 0.642 0.704 0.645 0.662 CHNKNERRALPHA 0.66 0.663 0.651 0.657 0.603 0.626 0.616 0.599 0.653 0.674 0.566 0.6 CHNKNERRSTOP 0.686 0.684 0.684 0.694 0.577 0.629 0.635 0.621 0.652 0.693 0.402 0.344 CHNKNERRSTOPALPHA 0.618 0.617 0.591 0.607 0.404 0.598 0.62 0.582 0.648 0.623 0.451 0.34 CHNKNER 0.718 0.723 0.721 0.737 0.582 0.669 0.603 0.63 0.673 0.722 0.654 0.642 CHNKNERALPHA 0.675 0.676 0.663 0.663 0.599 0.641 0.618 0.609 0.649 0.684 0.557 0.614 CHNKNERSTOP 0.724 0.724 0.715 0.724 0.582 0.663 0.635 0.652 0.679 0.72 0.501 0.298 CHNKNERSTOPALPHA 0.666 0.661 0.644 0.668 0.386 0.615 0.659 0.625 0.656 0.647 0.431 0.406 CHNKALPHA 0.684 0.681 0.669 0.683 0.607 0.643 0.647 0.616 0.676 0.695 0.587 0.583 CHNKSTOP 0.722 0.721 0.711 0.723 0.577 0.67 0.667 0.648 0.679 0.715 0.386 0.342 CHNKSTOPALPHA 0.629 0.637 0.606 0.619 0.395 0.608 0.649 0.654 0.664 0.628 0.455 0.374 DEP 0.617 0.619 0.568 0.587 0.243 0.617 0.536 0.566 0.598 0.594 0.682 0.694 DEPNERR 0.61 0.614 0.571 0.587 0.241 0.611 0.533 0.562 0.596 0.595 0.67 0.695 DEPNERRALPHA 0.606 0.605 0.589 0.602 0.312 0.596 0.537 0.556 0.595 0.593 0.585 0.622 DEPNERRSTOP 0.602 0.599 0.564 0.568 0.273 0.615 0.543 0.572 0.6 0.578 0.726 0.702 DEPNERRSTOPALPHA 0.584 0.584 0.56 0.581 0.386 0.599 0.544 0.561 0.595 0.574 0.583 0.619 DEPNER 0.624 0.621 0.574 0.585 0.242 0.611 0.528 0.564 0.595 0.592 0.686 0.692 DEPNERALPHA 0.585 0.589 0.561 0.579 0.213 0.607 0.578 0.497 0.593 0.603 0.606 0.623 DEPNERSTOP 0.611 0.602 0.564 0.576 0.274 0.604 0.527 0.563 0.604 0.577 0.725 0.708 DEPNERSTOPALPHA 0.535 0.531 0.523 0.523 0.297 0.543 0.563 0.422 0.576 0.564 0.63 0.632 DEPALPHA 0.609 0.612 0.588 0.601 0.314 0.6 0.545 0.552 0.604 0.598 0.606 0.62 DEPSTOP 0.606 0.595 0.562 0.571 0.276 0.616 0.544 0.576 0.603 0.584 0.741 0.648 DEPSTOPALPHA 0.586 0.587 0.564 0.588 0.388 0.594 0.539 0.568 0.595 0.578 0.629 0.625 LEM 0.781 0.786 0.784 0.79 0.634 0.715 0.724 0.72 0.744 0.786 0.67 0.665 LEMNERR 0.74 0.737 0.742 0.74 0.601 0.692 0.697 0.683 0.724 0.749 0.658 0.663 LEMNERRALPHA 0.729 0.728 0.725 0.725 0.614 0.685 0.699 0.68 0.71 0.74 0.645 0.652 LEMNERRSTOP 0.737 0.734 0.726 0.732 0.609 0.682 0.727 0.69 0.72 0.741 0.371 0.364 LEMNERRSTOPALPHA 0.732 0.732 0.714 0.727 0.624 0.674 0.723 0.682 0.704 0.737 0.372 0.348 LEMPOSS 0.764 0.765 0.769 0.767 0.564 0.713 0.658 0.679 0.717 0.773 0.662 0.736 LEMPOSSALPHA 0.76 0.758 0.753 0.758 0.406 0.705 0.669 0.674 0.712 0.756 0.603 0.715 LEMPOSSSTOP 0.763 0.766 0.767 0.774 0.566 0.709 0.706 0.691 0.72 0.773 0.683 0.725 LEMPOSSSTOPALPHA 0.762 0.766 0.748 0.765 0.49 0.702 0.713 0.681 0.714 0.757 0.593 0.716 LEMNER 0.784 0.782 0.787 0.792 0.631 0.71 0.716 0.72 0.742 0.78 0.68 0.613 LEMNERALPHA 0.763 0.764 0.765 0.767 0.637 0.699 0.71 0.707 0.742 0.768 0.662 0.671 LEMNERSTOP 0.782 0.783 0.782 0.792 0.634 0.706 0.745 0.725 0.742 0.78 0.429 0.378 LEMNERSTOPALPHA 0.77 0.767 0.752 0.767 0.64 0.693 0.739 0.716 0.738 0.768 0.46 0.414 LEMPOS 0.778 0.778 0.788 0.79 0.517 0.711 0.663 0.727 0.741 0.783 0.665 0.64 LEMPOSALPHA 0.768 0.772 0.772 0.768 0.522 0.7 0.654 0.713 0.727 0.775 0.664 0.695 LEMPOSSTOP 0.78 0.781 0.788 0.788 0.642 0.708 0.708 0.721 0.735 0.783 0.715 0.707 LEMPOSSTOPALPHA 0.77 0.769 0.766 0.768 0.669 0.696 0.718 0.722 0.73 0.778 0.669 0.698 LEMALPHA 0.755 0.764 0.745 0.765 0.294 0.703 0.718 0.705 0.748 0.754 0.61 0.651 LEMSTOP 0.787 0.786 0.784 0.791 0.641 0.713 0.754 0.732 0.752 0.789 0.403 0.327 LEMSTOPALPHA 0.772 0.766 0.766 0.773 0.357 0.702 0.747 0.712 0.745 0.764 0.377 0.329 POSS 0.487 0.487 0.488 0.491 0.522 0.498 0.556 0.509 0.555 0.488 0.54 0.536 POSSALPHA 0.488 0.486 0.488 0.498 0.526 0.498 0.552 0.518 0.549 0.493 0.538 0.534 POSSSTOP 0.477 0.477 0.471 0.467 0.518 0.486 0.54 0.496 0.533 0.484 0.431 0.434 POSSSTOPALPHA 0.469 0.47 0.471 0.465 0.517 0.478 0.525 0.484 0.511 0.491 0.428 0.484 TOK 0.793 0.788 0.793 0.796 0.632 0.716 0.711 0.728 0.748 0.796 0.659 0.661 TOKNERR 0.741 0.744 0.737 0.743 0.6 0.696 0.688 0.671 0.719 0.749 0.655 0.631 TOKNERRALPHA 0.734 0.735 0.735 0.73 0.624 0.683 0.681 0.674 0.704 0.748 0.626 0.655 TOKNERRSTOP 0.736 0.736 0.728 0.732 0.609 0.68 0.73 0.678 0.71 0.751 0.406 0.317 TOKNERRSTOPALPHA 0.728 0.731 0.727 0.723 0.623 0.675 0.721 0.68 0.698 0.744 0.412 0.394 TOKPOSS 0.766 0.768 0.767 0.783 0.549 0.715 0.648 0.671 0.715 0.773 0.686 0.729 TOKPOSSALPHA 0.765 0.761 0.763 0.767 0.378 0.709 0.662 0.656 0.709 0.769 0.643 0.658 TOKPOSSSTOP 0.763 0.765 0.767 0.773 0.563 0.704 0.703 0.684 0.724 0.771 0.675 0.722 TOKPOSSSTOPALPHA 0.774 0.773 0.774 0.771 0.671 0.694 0.722 0.713 0.73 0.779 0.68 0.698 TOKNER 0.789 0.785 0.788 0.789 0.609 0.708 0.703 0.722 0.745 0.784 0.684 0.68 TOKNERALPHA 0.768 0.771 0.763 0.776 0.628 0.696 0.701 0.705 0.746 0.775 0.649 0.648 TOKNERSTOP 0.785 0.791 0.79 0.79 0.635 0.703 0.732 0.721 0.743 0.79 0.444 0.367 TOKNERSTOPALPHA 0.773 0.771 0.762 0.774 0.646 0.691 0.737 0.704 0.74 0.771 0.371 0.379 TOKPOS 0.781 0.783 0.791 0.798 0.565 0.713 0.656 0.72 0.739 0.787 0.626 0.705 TOKPOSALPHA 0.775 0.775 0.778 0.784 0.576 0.699 0.653 0.705 0.731 0.783 0.633 0.698 TOKPOSSTOP 0.786 0.783 0.794 0.792 0.645 0.7 0.711 0.733 0.739 0.789 0.706 0.691 TOKPOSSTOPALPHA 0.759 0.766 0.756 0.762 0.458 0.696 0.706 0.679 0.674 0.601 0.734 0.718 TOKALPHA 0.768 0.768 0.757 0.773 0.271 0.705 0.721 0.705 0.742 0.756 0.643 0.652 TOKSTOP 0.793 0.79 0.784 0.794 0.644 0.708 0.758 0.736 0.749 0.787 0.355 0.321 TOKSTOPALPHA 0.775 0.776 0.766 0.776 0.342 0.7 0.745 0.714 0.744 0.765 0.452 0.425 13 (a) Alphabetic filtering (red) vs others (blue) (b) Alphabetic filtering (red) vs others (blue) (c) NER (red) vs others (blue) (d) NER (red) vs others (blue) (e) POS (red) vs others (blue) (f) POS (red) vs others (blue) (g) Stopword filtering (red) vs others (blue) (h) Stopword filtering (red) vs others (blue) (i) TOK (red), LEM (green), CHNK (yellow), DEP (blue) (j) TOK (red), LEM (green), CHNK (yellow), DEP (blue) Figure 1: FD & F1 score for SGD SVM (left) and CNN1 (right) 14