Hybrid AdaBoost and Naïve Bayes Classifier for Supervised Learning Ahiya Ahammed, Balazs Harangi, Andras Hajdu University of Debrecen, Faculty of Informatics ahiya.evan@gmail.com harangi.balazs@inf.unideb.hu hajdu.andras@inf.unideb.hu Proceedings of the 1st Conference on Information Technology and Data Science Debrecen, Hungary, November 6–8, 2020 published at http://ceur-ws.org Abstract Supervised learning is a task of machine learning that maps an input to an output based on available data. The data set contains the information from which we get to know the process of classifying at least some of the data correctly. Thus, the classified data is called Training set. In this learning method, the supervision comes from the instances having labels in the train- ing set. Classification problems are mostly referred to come from the branch of supervised learning. Classification is a function of machine learning that finds out the correctly predicted class labels of instances for all the unlabelled instances. In this research, our main focus will be discussing about the most commonly used data classification methods especially Naïve Bayes (NB) and Boosting basically Adaptive Boosting (AdaBoost) classifiers. In preparation for the enhancement of the performances with regards to the accuracy rate of those classifiers we would like to introduce two newly hybrid approaches for classification when the data set is big, noisy and high dimensional. In real life, the data sets usually contain noise or outliers, contradictory in- stances or missing values. Mostly the data got affected by them during the time of data collection or generation. To resolve that issue, we proposed two new hybrid classifiers. Our first hybrid approach is the ADA+NB classifier, where we used Adaptive Boosting (AdaBoost) classifier to find comparatively more important attribute subsets before the assumption of class conditional Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 independence using Naïve Bayes (NB) classifier [13]. Besides, NB classifier assumes class conditional independence, therefore, it’s possible to multiply the probabilities when the events are independent. Because of that, NB clas- sifier can be very effective in removing examples from the training set before the decision tree (DT) generation at the time of building AdaBoost model. We named this process our second proposed hybrid NB+ADA classifier [14]. This paper investigates the comparison between two classical machine learn- ing approaches with two new hybrid classifiers in terms of accuracy rate, error rate, precision, f-score, sensitivity, specificity analysis on 4 real benchmark data sets who are high dimensional and noisy chosen from UCI (University of California, Irvine) machine learning repository. For instance, Adult data is one of the renowned noisy data set available in UCI. Therefore, we tested Ad- aBoost classifier which gave us accuracy of 87.65% and NB classifier provided 79.99%. On the other hand, our first proposed (ADA+NB) classifier showed 86.39% and our second proposed (NB+ADA) indicated 94.14% of the accu- racy rate with the same data. Similarly, we used other data-sets and derived performance comparison between the classifiers to prove that our proposed classifier’s performances are higher than the text-book classifier’s. Keywords: Supervised learning, classification, hybrid classifier, decision tree, adaptive boosting, Naive Bayes classifier and performance comparison 1. Introduction This research intends to establish two new, strong and powerful hybrid classifiers with the following names ADA+NB and NB+ADA and compare them with two existing machine learning approaches such as AdaBoost and Naïve Bayes classi- fiers in order to gain the highest accuracy rate. Currently, increasing number of researchers are using different classical machine learning algorithms to solve differ- ent regression, classification and clustering problems. Classification problems are the most widely studied and filled with plenty of new research domains in the area of data science. Different applications of classification are being used in variety of problem domains such as multimedia, text, medical or biological data and social networking. These domains are using various methods of data classification such as neural networks, decision trees, probabilistic or rule-based classification, Bagging, Boosting, Support Vector Machine (SVM) and other instance-based techniques etc. This paper consists of 2-major portions: comparative analysis on the machine learning models, evidence of choosing our proposed classifiers over 2-classical ma- chine learning classifiers with respect to higher accuracy rates, especially when the training data is too noisy and large in dimension. The first part “comparative analysis of machine learning models”, contains the comparison between the classical machine learning algorithms such as Adaptive Boosting (AdaBoost) and Naïve Bayes (NB) with our two proposed hybrid algo- rithms, first hybrid classifier denoted as ADA+NB, because it starts with Adaboost induction and then it classifies with the NB classifier and second one is NB+ADA, because it starts with NB induction and then it classifies according to the Ad- 2 aBoost classifier model. In the second part, the “evidence of choosing our proposed algorithms” makes the comparison between our proposed algorithms with the clas- sical Naïve Bayes (NB) classifier and Adaptive Boosting (AdaBoost) classifier, and states that when the data sets are noisy or there are presence of outliers and re- spectively high in dimensions the proposed hybrid classifiers perform well, when it comes to the accuracy rate, error rate, precision, f1-score, sensitivity/recall or speci- ficity records analysis. Generally, the classifier’s performance in machine learning depends on the data quality. Sometimes data with the lower quality are used as the training set to build the model that might lead to creating the over-fitting problems for that particular model. For instance, though AdaBoost is a strong classifier, due to the presence of noise or outliers in the training data, if anyone tries to build AdaBoost model from that data, decision trees (DT) might suffer from the over fitting problem, during the time of tree generation. As a result, being a classifier with high calculations AdaBoost fails to provide the expected accuracy rate [3]. Therefore, it is really important to pre-process the data set before it can be used for the training. Data pre-processing can improve the data quality. As a result, it can take tremendous part to enhance the classifier model’s efficiency. Therefore, accuracy might also be improved. In paper [14] several pre-processing techniques are available such as: 1. Data cleaning: Removing noise or outliers or missing values. (a) Outliers or anomaly: Using method of binning, process of regression or clustering. (b) Missing value: Ignoring the instance or filling the missing values (con- sidering the mean value or manually). 2. Integration: Data are merged from different resources. 3. Transformations: Normalization of data, proper selection of attributes, dis- cretization of the data. 4. Reduction: Cube aggregation of the data, subset selection of attributes, shorten the dimensions. 2. Related Work 2.1. Adaptive Boosting (AdaBoost) Classifier During the process of Boosting in paper [1] a weight is assigned for each training example. It also adjusts the weight efficiently at completion of individual cycle. Not correctly classified examples will have more increased weight, whereas cor- rectly classified ones will have more decreased weights. In consecutive iterations, AdaBoost is forced to concentrate on the hardly identified examples. The paper [13] discussed about a AdaBoost classifier for biological data which have multiple classes. It takes into consideration noise reduction, over-fitting and 3 class-imbalance problems. To stop over-fitting, the classifier considers a random subspace solution. By giving it a flavor of boosting, it pays more emphasis on the incorrectly classified instances in the next round or iteration. Adaptive boosting was first applied in 1997 by Freund and Schapire [16]. Over- fitting is the main disadvantage of AdaBoost. When the final classifier gets too complicated, the test error increases [4, 10, 22]. The AdaBoost classifier’s test error also starts to reduce after the training error becomes 0 and never rise back; when lots of rounds are passed through. Besides, after incredibly huge rounds, the error on the test set was found to be increased marginally [17] Boosting, along with the more developed classifiers, such as the Decision Tree (DT), Neural Network (NN), Support Vector Machine (SVM) and Naïve Bayes (NB) classifier have become one of the alternative mechanisms for classification [6]. AdaBoost classifier’s popularity can be due to its potential to stretch the margin according to the research, [23] which may increase the classifier’s generalization capabilities. There have been reports of several experiments using Decision Trees (DTs) [9] or Neural Networks [24] or Support Vector Machine (SVM) [19] as element classifiers or learning schemes in Adaptive Boosting (AdaBoost) classifier. Such experiments indicate strong success of these AdaBoost classifier’s generalization, with some difficulties still remaining. What should the appropriate size of the tree? Especially when Decision Trees (DTs) are used as learning scheme. And how should manage computational difficulties? Especially when Neural Networks (NNs) are used as learning scheme. Before using AdaBoost classifier, both of these questions need to handled carefully to produce better results. 2.2. Decision Tree (DT) Classifier Construction of decision tree (DT) is very simple and useful approach with a wide number of variables for classifying instances in large data sets. During the process of DT generation, 2-widely known steps are followed: • Decision tree evolution has to be made so that the examples available on the training can be classified correctly. • With regard to getting higher classification rate; precision or accuracy rate at time of pruning some redundant nodes and branches needed to be eliminated. In paper [15] they proposed a new algorithm named Decision Tree Fast Splitting (DTFS), is used for large data sets to construct decision trees (DTs). To extend tree nodes as well as training examples are processed in an incremental way, DTFS used this attribute collection technique. DTFS has stored a limit of N instances in a leaf node in order to avoid saving all the examples available in DT. Until the number of instances available in leaf node that have been stored, met it’s maximum, the leaf node was modified accordingly. A new approach of constructing Decision Tree (DT) while solving classification problems using cluster based analysis method; named Classification based clus- tering (CbC), was presented in the research paper [3] Similarities were observed 4 between instances while using the clustering algorithms and specified some target attributes. Afterwards, it evaluated the distribution of each cluster’s target at- tributes. When the number of instances contained in a cluster is met for a specific threshold value, all instances in each cluster were listed according to the target attribute’s acceptable value. They recommended a new classification scheme on a classifier for the Decision Tree (C4.5 algorithm) and to enlarge the accuracy rate for solving multi-class problems in [21] and they also introduced an unique approach where, M numbers of classifiers were developed by their system, each of which distinguished one class from the other. A procedure is suggested in [2] in the analysis to overcome the possible exceptions in the algorithms of simple decision tree (DT) induction. In their work, the influential element of attributes was discovered, which provided the reliability of the attribute value on the target value. Based on this aspect, the DT was pruned. In the research [8] an algorithm is made especially for the Fuzzification of deci- sion boundaries without translating attributes into fuzzy linguistic terminology. In order to select suitable attribute for individual node that needed to be split, Gini Index is used by the G-FDT tree. According to the average-points of feature values in which the details of the class altered, the split-points were selected. 2.3. Naïve Bayes (NB) Classifier In the area of data science or machine learning Naïve Bayes (NB) Classifier is most popular especially for classification problems. A new classifier named Hidden Naïve Bayesian (HNB) was added by authors in paper [5] through a system that can be used for the detection of any network’s invasion. This greatly increased the attack detection accuracy in Denial of Services (DoS). Main reason of having another layer, representing a individual feature’s secret or hidden parent was basic idea in the process of HNB. In terms of higher accuracy rate or lower error rate, the HNB multi-class classifier demonstrated comparatively better in performance [20, 25]. During the research of [7] they suggested a Robust Naïve Bayesian Classifier (RNBC) to address the original limitations, one is over-fitting and another one is underflow, for gene expression data sets classification. Instead of multiplying probabilities, RNBC used an approximate method to provide solutions to over- fitting issues and probability logarithms to deal with the underflow problems. For the Naïve Bayesian classification of micro-array data analysis, “Partition Conditional Independent Component Analysis (PC-ICA)”; a new approach was suggested by the authors of [11]. They used Class Conditional Independent Variable Analysis (CCICA) approach. In order to conduct the “Independent Component Analysis (ICA)” within individual partition, “PC-ICA” splashed the small sized data samples into various partitions. Inside of each partition, which can consist of many classes, “PC-ICA” also tried to do “ICA” based function uprooting. A newly proposed classification system for multi classes data classification called Extended Naïve Bayesian (ENB) was suggested in [18]. Categorical and numeric 5 data were used in the mixed data categories. To compute the probability of cat- egorical features, ENB used a classical Naïve Bayes (NB) classifier approach. By observing the average and variation of numeric values, the mathematical principle was given for discretizing numeric features into symbols while treating numeric attributes. 3. Classification In machine learning, classification derives a function from the training set. Training data-set consists of set of input vector or scalar and target output vector or scalar (class-label). The result of a function may be a constant value, or the target values of input vectors can be expected. Sometimes, it is also referred to supervised learning since, before analyzing the data, the class labels are known. 3.1. Adaptive Boosting (AdaBoost) Classifier AdaBoost is a commonly used boosting algorithm that takes into account a number of classifiers and combines each classifier’s votes to classify an unknown or known instance. • Weights are allocated to each training example at the time of boosting • There is an iterative trained sequence of k classifiers • 𝑀𝑖 is learned after using a classification model, the weights are modified to help the following classifier, 𝑀𝑖+1 to give more emphasis on the instances incorrectly classified by 𝑀𝑖 In Adaptive Boosting (AdaBoost) after following the above mentioned steps, a final boosted classifier, M* is found. It blends individual classifier’s votes and these weights of the votes of every classifier is accuracy rate’s function. If the original training dataset D is given with a set of d class-labeled instances available, (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑑 , 𝑦𝑑 ), where 𝑦𝑖 is considered as the class label of instance 𝑥𝑖 . Assume, assigned weights which were allocated to each training example initially is 𝑑 . For the ensemble, generating k classifiers requires k rounds over the rest of the 1 algorithm. In order to shape some size training set, we should consider samples, not needed to be the size d. Sampling with replacement is used so that the same example can be chosen more than once. The probability of being chosen for each instance is dependent on its weight. M is a classifier model, can be generated from 𝐷𝑖 training instances. Then it’s error can be measured as a test set using 𝐷𝑖 . The weights of the instances can be modified if needed according to their performances. When any instance is misclassified the weight of its increased and if that instance is properly classified then its weight is reduced. Judging from instance’s weights, classification difficulties can be noticed. Instances with higher weights are hard to classify. Weights are important to select instances that will be participating in the next iterations for generating next classifiers. 6 Error Rate: We measured sum of weights for individual instances in 𝐷𝑖 that were not correctly classified by 𝑀𝑖 to measure the error rate of 𝑀𝑖 model. 𝑑 ∑︁ Error(𝑀𝑖 ) = error(𝑥𝑗 ) * 𝑤𝑗 𝑗=1 Here, the error of not correctly classified instance 𝑥𝑖 is denoted as error (𝑥𝑗 ). If the instance 𝑥𝑖 is correctly classified by the classifier, then error (𝑥𝑖 ) is 0. In contrast, it is 1. A threshold value 0.5 is used to measure the classifier’s performance. If it is more than the threshold, then it cannot be considered. Thus, repetition of the same process is necessary. Normalizing Weight: If an instance was correctly classified by the classifier in 𝑖𝑡ℎ iteration, then the weight of its multiplied by, (︂ )︂ error(𝑀𝑖 ) Error . 1 − error(𝑀𝑖 ) According to [12] the weights for all instances are normalized until properly classi- fied instance’s weights are changed. We considered the product value of a weight and sum of old weights; divided product value by the new weights, to normalize it. Algorithm: 1: AdaBoost Classifier Input: Training data, D, number of iterations, k and a learning scheme. Output: Ensemble model, M ∗ . Method: 1. Initialize weight, xi ∈ D to 1/d. 2. for i = 1 to k do 3. Sample D with replacement according to instance weight to obtain Di . 4. Use Di and learning scheme to derive a model, Mi , 5. Compute error(Mi ). 6. if error(Mi ) >= 0.5 then, 7. Go back to step 3 and try again. 8. end if 9. for each correctly classifier xi ∈ D do 10. Multiply weight of xi by error(Mi )/(1 − error(Mi )) 11. end for 12. Normalize weight of instances. 13. end for To use M ∗ to classify new instances, xnew : • Initialize weight of each class to zero • for i = 1 to n do • wi = log((1 − error(Mi ))/error(Mi )) //weight of classifier’s vote. • c = Mi (xnew ) //class prediction by Mi , • add wi , to weight for class c. • end for • return class with largest weight. Figure 1. Adaptive Boosting (AdaBoost) classifier. 3.2. Naïve Bayesian (NB) Classifier A fundamental probabilistic classifier contingent on Bayes hypothesis with solid predictions of (Naïve) independence and models of independent characteristics; is called Naïve Bayesian (NB) classifier. While determining the probability of 7 participation for each class, attributes with missing values are perfectly handled by merely omitting the corresponding probabilities by the NB classifier itself. Class conditional independence is also necessary. For instance, the attribute’s effect on a defined class can be distinct from other attributes. NB classifier assumes that an training example X belongs to a class 𝐶𝑖 if and only if 𝑃 (𝐶𝑖 |𝑋) > 𝑃 (𝐶𝑗 |𝑋) belongs to 𝑗 ̸= 𝑖 and 1 ≤ 𝑗 ≤ 𝑚. Therefore, highest posterior analysis shows 𝐶𝑖 class due to that 𝑃 (𝐶𝑖 |𝑋) is maximized. 𝑃 (𝑋|𝐶𝑖 ) × 𝑃 (𝐶𝑖 ) 𝑃 (𝐶𝑖 |𝑋) = (3.1) 𝑃 (𝑋) A constant 𝑃 (𝑋) is for all classes in equation of (3.1), then 𝑃 (𝑋|𝐶𝑖 ) × 𝑃 (𝐶𝑖 ) has to be increased. If the prior probabilities of the class are not specified, then the classes are generally believed to be identical, i.e. 𝑃 (𝐶1 ) = 𝑃 (𝐶2 ) = · · · = 𝑃 (𝐶𝑚 ). Thus, maximize 𝑃 (𝑋|𝐶𝑖 ) accordingly or 𝑃 (𝑋|𝐶𝑖 ) × 𝑃 (𝐶𝑖 ) needed to be increased. The prior probabilities of the class are determined by 𝑃 (𝐶𝑖 ) = |𝐶𝑖 , 𝐷|/|𝐷|, here, |𝐶𝑖 , 𝐷| is the number of available instances that are ready to be trained in 𝐷 belonging to the 𝐶𝑖 class. It is highly computationally costly to calculate 𝑃 (𝑋|𝐶𝑖 ) in a dataset which contains lots of attributes. Provided the class labels of the events, the attributes are conditionally independent of each other. The equation showed in (3.2) is used for generating 𝑃 (𝑋|𝐶𝑖 ) 𝑛 ∏︁ 𝑃 (𝑋|𝐶𝑖 ) = 𝑃 (𝑥𝑘 |𝐶𝑖 ) (3.2) 𝑘=1 In equation (3.2), 𝑥𝑘 corresponds to the value of the 𝐴𝑘 feature, for any 𝑋 instance. However, datasets that are usually used for the training can be numeric, nominal or continuous or categorical. Attributes are usually believed to have a Gaussian distribution having a mean 𝜇 and standard deviation 𝜎 which can be represented from the following equations: 𝑃 (𝑥𝑘 |𝐶𝑖 ) = 𝑔(𝑥𝑘 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 ) (3.3) 1 (𝑥−𝑝)2 𝑔(𝑥, 𝜇, 𝜎) = √ 𝑒− 2𝜎2 2𝜋𝜎 In equation (3.3), mean value and standard deviation are denoted by 𝜇𝐶𝑖 and 𝜎𝐶𝑖 respectively, for the attribute value 𝐴𝑘 of all training examples that belong to the 𝐶𝑖 class. Now, 𝑃 (𝑥𝑘 |𝐶𝑖 ) can be calculated by using the method of substitution using the 𝑔(𝑥, 𝜇, 𝜎) value. 8 Algorithm: 2: Naïve Bayesian Classifier Input: D = {x1 , x2 , . . . , xn } //Training data. Output: A naïve Bayes model. Method: 1. for each class, Ci ∈ D do 2. Find the prior probabilities P (Ci ). 3. end for 4. for each attribute Ai ∈ D do 5. for each attribute value Aij ∈ Ai , do 6. Find the class conditional probabilities P (Aij |Ci ). 7. end for 8. end for 9. for each instance, xi ∈ D do 10. Find the posterior probabilities P (Ci |xi ). 11. end for Figure 2. Naïve Bayesian (NB) classifier. 3.3. Proposed Algorithm1 (ADA+NB) Classifier Our proposed algorithm1 (ADA+NB) classifier is a newly proposed, hybrid and very powerful classifier to solve the problems in classification domains, it made up with two strong classifiers, one is Boosting algorithm basically Adaptive Boosting (AdaBoost), with a learning scheme of decision tree and another one is Naïve Bayes (NB) classifier. It performs better when the dataset is too noisy and high dimensional. We named this classifier ADA+NB because it uses the AdaBoost classifier to detect the noise or missing value from the data. Because AdaBoost has an amazing ability to find the subset of attributes which are comparatively more important. At the beginning, equal weights are assigned to all the training instances and those who are misclassified by AdaBoost gains higher weights. Afterwards, we deleted the noisy instances from the original training data which have higher weights and prepares the data to go for the training and uses Naïve Bayes (NB) classifier to make the predictions. And if we used AdaBoost and try to build model using the training set which is noisy; there is high probabilities that AdaBoost might fall into the over-fitting during the time of Decision tree (DT) generation. Algorithm: 3: ADA+NB Classifier Input: D = {x1 , x2 , . . . , xn } // Training data. Output: An ADA+NB Model. Method: 1. Assign weights 1/d initially to all the instances on D. 2. Split the dataset D into 70% DTrain for training and 30% DTest for testing. 3. Use DTrain to build AdaBoost model using algorithm 1 and test each of the instances xTest from DTest . 4. Increase the weights of the misclassified instances at the original data D by the followings: weight = weight ∗ 1/2 ∗ (error/accuracy) 5. Delete the data/instances, which have higher weights from the original dataset D. 6. Use the new cleaned dataset DTrainNew to build a new model NB using the Algorithm 2 and calculate the performance. Figure 3. ADA+NB Classifier. 9 3.4. Proposed Algorithm2 (NB+ADA) Classifier Our proposed algorithm2 (NB+ADA) classifier is another newly proposed, hybrid and powerful classifier to solve the classification problems, it made up with two strong classifiers, one is Naïve Bayes (NB) classifier and another one is powerful Boosting algorithm to be more specific Adaptive Boosting (AdaBoost) classifier. Here, decision trees were used as a weak learning scheme during the AdaBoost in- duction. It performs better when the data-set is too noisy and larger in dimensions. We named it NB+ADA, the reason behind it is we used NB classifier to find the noise or missing values from the data. As, NB has a ability to handle the missing value by simply ignoring the it or use the mean value. Initially, all the instances were given weights and those instances who were misclassified by the NB classifier have the higher weights than before. Afterwards, we delete the instances, given the priority of higher weights, which can not be handled by NB classifier from the original data assuming that they are anomaly in the data. Then, we proceeded comparatively cleaned data used as the training data-set for AdaBoost to build the model and make the predictions. In that way due to the less presence of anomaly in the recently created training set, AdaBoost would have lower chances of falling into over-fittings. Algorithm: 4: NB+ADA Classifier Input: D = {x1 , x2 , . . . , xn } // Training data. Output: A NB+ADA Model. Method: 1. Assign weights 1/d initially to all the instances on D. 2. Split the dataset D into 70% DTrain for training and 30% DTest for testing. 3. Use DTrain to build NB model using algorithm 2 and test each of the instances xTest from DTest . 4. Increase the weights of the misclassified instances at the original data D by the followings: weight = weight ∗ 1/2 ∗ (error/accuracy) 5. Delete the data/instances, which have higher weight from the original dataset D. 6. Use the new cleaned datasct DTrainNew to build a new model AdaBoost using the Algorithm 1 and calculate the performance. Figure 4. NB+ADA Classifier. 3.5. Architecture Comparison In this section, we can see the plots of the architectural view of the 4-classifiers shown in Figure 5 and compare them. Here, Figure 5(a) shows the architecture of Adaptive Boosting (AdaBoost) classifier. It uses the Algorithm 1 to generate the AdaBoost model. Hence, Figure 5(b) stands for the Naïve Bayesian (NB) classifier architecture using the Algorithm 2 to build a NB model and Figure 5(c) and 5(d) illustrates the architectural model of our proposed classifier1 (ADA+NB) and proposed classifier2 (NB+ADA) which uses the Algorithm 3 and Algorithm 4 to build it’s model accordingly. It is clearly seen that all the classifiers that were used in this research, we used the same split-validation process, which is 70% for training and 30% for testing the model. It is a matter of great concern that the amount of data are used to train the classifier’s model should be more; if better 10 classification performances are expected from that model. (a) (b) (c) (d) Figure 5. Classifier’s architectural comparisons. 3.6. Experiment This section of the paper describes the data sets that are used for the experiment. It also represents the performance evaluation of Adaptive Boosting (AdaBoost), Naïve Bayesian (NB), our proposed (ADA+NB) and our proposed (NB+ADA) classifiers. Datasets: We used our chosen 4-real benchmark data sets collected from the online free data resource named UCI machine learning repository and also evaluated the classification performances for the above mentioned classifiers. Each of the data sets consists of a two dimensional (2D) Excel spreadsheet. The data sets are following: • Adult Database • Covertype Data • Dota2 Games Results Data 11 • Isolet Database Table 1. Data-set Description. No. of Att. No. of In- No. of Data-sets Att. Types stances Classes Categorical, Adult 14 48842 2 Integer Categorical, Covertype 54 581012 7 Integer Dota2 Integer, Games 116 102944 2 Real Results Isolet 617 Real 7797 26 Experimental Setup: We conducted all the experiments using a 2.30 GHz; Intel Core i3 Processor and RAM of 4 GB machine. We implemented both Ad- aBoost and Naïve Bayesian (NB) classifiers and our proposed hybrid algorithms (Algorithms 1 and 2) using Python 3.6. Several Python packages were used: Scikit Learn, Numpy, Pandas, Seaborn, Math, Itertools, Catplot and Matplotlib were used for plotting graphs. Besides, the soft-wares were used to run the program are Anaconda/Spyder and Microsoft Excel for the performance records. Different symbols were used in measuring the classifier’s performances such as follows: Table 2. Meanings of performance evaluation symbols. Symbols Symbolic Meanings Accuracy Percentage of assumptions which are correct Precision Percentage of "+" assumptions which are correct F1-score Harmonic mean of precision and sensitivity/recall Sensitivity Percentage of "+" instances (predicted as "+") Specificity Percentage of "-" instances (predicted as "-") Results Evaluation: At first, the performance of the proposed classifiers (Algorithms 1 and 2) against existing AdaBoost and NB classifiers was tested using the classification accuracy of the training sets of the 4 benchmark data sets shown in Table 3–6, summarizing the classification accuracy rates of the simple AdaBoost and NB classifiers, and our proposed hybrid algorithms for each of the 4 test data sets. We have used the weighted average to record the precision, f1-score, sensitivity/recall, specificity. The results of the Table 3 indicates that despite all the algorithms are perform- ing well in terms of accuracy rate but if we check the Specificity and Sensitivity for this Adult data-set we can find that both of them are almost 94% and 99% for the proposed algorithm2(NB+ADA) respectively. 12 Table 3. Classifier’s performances on Adult Data-set. Error Accuracy Algorithms Rate Precision F1-score Sensitivity Specificity (%) (%) AdaBoost 87.65 12.35 0.87 0.88 0.68 0.94 Naïve 79.99 20.00 0.78 0.77 0.31 0.95 Bayes Proposed 86.39 13.60 0.85 0.85 0.45 0.96 Algorithm1 Proposed 94.14 5.86 0.96 0.95 0.99 0.94 Algorithm2 Hence, Table 4 illustrates all the four classifier’s performance on the data-set named Covertype. It is visible that accuracy rates are poor due to the data-set imbalanced, still our proposed algorithm1(ADA+NB) is able to provide the 45.76% accuracy with 95% Specificity. Table 4. Classifier’s performances on Covertype Data-set. Error Accuracy Algorithms Rate Precision F1-score Sensitivity Specificity (%) (%) AdaBoost 44.29 55.70 0.59 0.48 0.52 0.79 Naïve 45.31 54.69 0.64 0.41 0.21 0.96 Bayes Proposed 45.76 54.24 0.65 0.42 0.22 0.95 Algorithm1 Proposed 44.29 55.70 0.59 0.48 0.52 0.79 Algorithm2 Table 5. Classifier’s performances on Dota2 Games Result data-set. Error Accuracy Algorithms Rate Precision F1-score Sensitivity Specificity (%) (%) AdaBoost 58.82 41.18 0.59 0.59 0.66 0.51 Naïve 55.61 44.39 0.56 0.56 0.59 0.52 Bayes Proposed 73.41 26.59 0.73 0.73 0.81 0.63 Algorithm1 Proposed 76.19 23.81 0.76 0.76 0.81 0.69 Algorithm2 The performance of the Dota2 Games Results data-set from Table 5 states that our proposed algorithm2(NB+ADA) performs outstanding with less number 13 of error rate of around 23% which proves our proposed algorithms strength over AdaBoost and NB classifiers for this data-set. Table 6. Classifier’s performances on Isolet Data-set. Error Accuracy Algorithms Rate Precision F1-score Sensitivity Specificity (%) (%) AdaBoost 28.89 71.10 0.30 0.22 1.00 0.00 Naïve 81.46 18.54 0.85 0.81 0.88 1.00 Bayes Proposed 83.55 16.45 0.87 0.83 0.88 1.00 Algorithm1 Proposed 31.19 68.81 0.33 0.24 1.00 0.00 Algorithm2 The data of the Table 6 indicates that if we consider higher Accuracy rate for the Isolet data-set Naïve Bayes (NB) classifier and our proposed algorithm1 (ADA+NB) performs well with the Accuracy rate of 81.46% and 83.54%. Even, if we compare the Specificity which is 100% for both of them but Sensitivity is much higher for NB. Now, if we compare the Error rate between the two classifiers then Proposed Algorithm1(ADA+NB) works better than classical NB classifier. 4. Graphical Comparison In this section, we plotted the performance graph in Figure 6 for AdaBoost, Naïve Bayes (NB), Proposed Algorithm1 (ADA+NB) and Proposed Algorithm2 (NB+ ADA) respectively and compared their performances based on the Accuracy rate on the 4-real benchmark data-set that are discussed in the above sections. 4.1. Classifier’s Performance Graphs If we analyze the graph shown in Figure 6(a) the performance of AdaBoost classifier, we can see that despite of being such a strong classifier AdaBoost’s performance for the Dota2 Games Result data-set and Isolet data-set denotes as orange and red color dots in the graph are not satisfactory. Performance for the Covertype data-set denotes as green dot is average due to the data-set imbalanced. But the Accuracy rate that we got for the Adult data-set represented as blue dot is around 90%. NB classifier’s performance shown in Figure 6(b). The NB classifier induction shows that for the Adult and Isolet data-set it performs extraordinary as the Accuracy rate is almost 90% for the both data-sets, denotes as blue and red color dots accordingly and Covertype orange colored dot shows the worst performance accuracy rate due to imbalanced data-set. 14 (a) (b) (c) (d) Figure 6. Classifier’s Performance Comparison Graphs. From the Figure 6(c) which illustrates proposed algorithm1(ADA+NB) classi- fier’s performance, it’s clearly visible that for Adult and Isolet data-set our pro- posed algorithm1 (ADA+NB) classifier provides Accuracy rate over 80%. But unfortunately for the Covertype data-set (orange color dot) it’s unable to provide better classification Accuracy due to the data-set is very imbalance with higher in dimension too and AdaBoost trees are facing over-fitting problems. Figure 6(d) stands for the performance of proposed algorithm2(NB+ADA) classifier, visualizes the performance of our proposed algorithm2 (NB+ADA) classifier over our chosen data-sets. And it can be seen from the graph that except the Isolet data-set which is denoted by red dot, this classifier’s performance is out ranging the other classi- fier’s performances for all the chosen data-sets. Because, this classifier uses NB to reduce the noise and then uses the AdaBoost method to generate decision trees so that this decision trees DT do not fall into over-fitting. 4.2. Datasets Vs. Classifier’s Performances Earlier mentioned classifier’s performances over chosen data-sets are shown in Fig- ure 7. Based on Accuracy rate, it is proved that for the Adult & Dota2 Games Result data-set, our proposed classifier2 (NB+ADA) can be chosen, denoted as red. 15 Though the accuracy rate is almost the same for the Covertype data-set, therefore it is hard to define any specific classifier for that particular data-set but still our first hybrid classifier1 (ADA+NB) denoted as green, is better. Also, for the Isolet data-set we can use that classifier too. Figure 7. Classifier’s Performance Comparison. 5. Conclusion This research focuses on the establishment of our two newly proposed hybrid clas- sifiers with regard to the improvements of the root classifier; Naïve Bayesian (NB) and Adaptive Boosting (AdaBoost). It is proved that our proposed classifiers have successfully increased the classification performance of AdaBoost and NB conse- quently when it comes to the accuracy rates. Technically, AdaBoost is one of the really strong Boosting algorithms. But it’s a fact that in real life the data sets are not well organised. Most of the time it is full with noise or missing values or different behavioural instances. Due to these obstacles in the data, it is often seen that AdaBoost fails to provide its own capability when it comes to the accuracy rate. Hence, both of our proposed hybrid methods used NB classifier to help Ad- aBoost; so that we can save AdaBoost from over-fitting. We used NB classifier here because of no complexity and just one scan is enough for the training data-set as a result the computation gets fast. Another reason for choosing NB is that it has an amazing way of handling missing value when it calculates each class’s similarities. In our first hybrid (ADA+NB) classifier, AdaBoost is used to remove the outliers from the training set and then proceeded comparatively clean data set for the Naive prediction. On the other hand, our second hybrid (NB+ADA) we used NB classifier to eradicate anomaly from the original data-set and passed the training examples to generate the decision trees (DTs) for AdaBoost model through preventing over-fitting. The performances of 2-classical approaches in ma- chine learning, AdaBoost and NB and our two proposed hybrid classifiers were tested. Afterwards, we compared the experimental results of the four classifiers. The experiment proved that our newly proposed classifiers exceed the performances 16 of other two textual approaches due to higher accuracy rates or lower error rates. We strongly believe that our proposed algorithms can be used in solving different classification problems in real life problem-domains. Forthcoming, we definitely want to extend our work by using some other classification techniques to remove the noise from the data such as Naive Bayes Trees or K-means clustering before the AdaBoost starts to generate its decision trees (DT). Acknowledgements. The entire research was supported by the Janos Bolyai Research Scholarship of the Hungarian Academy of Sciences and the GINOP-2.2.1- 18-2018-00012 supported by the European Union, co-financed by the European Social Fund. Therefore, we really feel privileged to get that intense support. References [1] A. A. Afza, D. M. Farid, C. M. Rahman: A Hybrid Classifier using Boosting, Clustering, and Naïve Bayesian Classifier, World of Computer Science and Information Technology Journal (WCSIT) 1.3 (2011), pp. 105–109. [2] S. Appavu alias Balamurugan, R. Rajaram: Effective solution for unhandled excep- tion in decision tree induction algorithms, Expert Systems with Applications 36.10 (2009), pp. 12113–12119, doi: https://doi.org/10.1016/j.eswa.2009.03.072. [3] B. Aviad, G. Roy: Classification by clustering decision tree-like classifier based on adjusted clusters, Expert Systems With Applications 38 (2011), pp. 8220–8228. [4] L. Breiman: Arcing classifier (with discussion and a rejoinder by the author), Ann. Statist. 26.3 (June 1998), pp. 801–849, doi: 10.1214/aos/1024691079. [5] T. Bujlow, M. Riaz, J. Pedersen: A method for classification of network traffic based on C5.0 Machine Learning Algorithm, in: 2012 International Conference on Computing, Networking and Communications (ICNC), Feb. 2012, pp. 237–241, doi: 10.1109/ICCNC.2012.6167418. [6] K. M. A. Chai, H. L. Chieu, H. T. Ng: Bayesian Online Classifiers for Text Classification and Filtering, in: SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, pp. 97–104, doi: 10.1145/564376.564395. [7] B. Chandra, M. Gupta: Robust approach for estimating probabilities in Naïve–Bayes Clas- sifier for gene expression data, Expert Systems with Applications 38.3 (2011), pp. 1293–1298, doi: https://doi.org/10.1016/j.eswa.2010.06.076. [8] B. Chandra, P. Paul Varghese: Fuzzifying Gini Index based decision trees, Expert Sys- tems with Applications 36.4 (2009), pp. 8549–8559, doi: https://doi.org/10.1016/j.eswa.2008.10.053. [9] T. G. Dietterich: An Experimental Comparison of Three Methods for Constructing En- sembles of Decision Trees: Bagging, Boosting, and Randomization, Machine Learning 40.2 (2000), pp. 139–157, doi: 10.1023/a:1007607513941. [10] H. Drucker, C. Cortes: Boosting Decision Trees. In: Advances in Neural Information Processing Systems, vol. 8, Jan. 1995, pp. 479–485. 17 [11] L. Fan, K.-L. Poh, P. Zhou: Partition-conditional ICA for Bayesian classification of microarray data, Expert Systems with Applications 37.12 (2010), pp. 8188–8192, doi: https://doi.org/10.1016/j.eswa.2010.05.068. [12] D. M. Farid, G. M. Maruf, C. M. Rahman: A new approach of Boosting using decision tree classifier for classifying noisy data, in: 2013 International Conference on Informatics, Electronics and Vision (ICIEV), 2013, pp. 1–4, doi: 10.1109/ICIEV.2013.6572718. [13] D. Farid, M. Al-Mamun, B. Manderick, A. Nowe: An adaptive rule-based classifier for mining big biological data, Expert Systems with Applications 64 (2016), pp. 305–316, doi: https://doi.org/10.1016/j.eswa.2016.08.008. [14] D. M. Farid, L. Zhang, C. M. Rahman, M. Hossain, R. Strachan: Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks, Expert Systems with Applications 41.4, Part 2 (2014), pp. 1937–1946, doi: https://doi.org/10.1016/j.eswa.2013.08.089. [15] A. Franco-Arcega, J. Carrasco-Ochoa, G. Sánchez-Díaz, J. Martínez-Trinidad: Decision tree induction using a fast splitting attribute selection for large datasets, Expert Systems with Applications 38.11 (2011), pp. 14290–14300, doi: https://doi.org/10.1016/j.eswa.2011.05.087. [16] Y. Freund, R. E. Schapire: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences 55.1 (1997), pp. 119–139, doi: https://doi.org/10.1006/jcss.1997.1504. [17] A. J. Grove, D. Schuurmans: Boosting in the Limit: Maximizing the Margin of Learned Ensembles, in: AAAI ’98/IAAI ’98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, 1998, pp. 692–699. [18] C.-C. Hsu, Y.-P. Huang, K.-W. Chang: Extended Naive Bayes classifier for mixed data, Expert Systems with Applications 35.3 (2008), pp. 1080–1083, doi: https://doi.org/10.1016/j.eswa.2007.08.031. [19] X. Li, L. Wang, E. Sung: AdaBoost with SVM-based component classifiers, Engineering Applications of Artificial Intelligence 21.5 (2008), pp. 785–795, doi: https://doi.org/10.1016/j.engappai.2007.07.001. [20] J. McHugh: Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory, Association for Computing Machinery 3.4 (2000), pp. 262–294, doi: 10.1145/382912.382923. [21] K. Polat, S. Güneş: A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems, Expert Systems with Applications 36.2, Part 1 (2009), pp. 1587–1592, doi: https://doi.org/10.1016/j.eswa.2007.11.051. [22] J. R. Quinlan: Bagging, Boosting, and C4.S, in: AAAI Press, 1996, pp. 725–730. [23] R. E. Schapire: The Boosting Approach to Machine Learning: An Overview, in: Denison D.D., Hansen M.H., Holmes C.C., Mallick B., Yu B. (eds) Nonlinear Estimation and Classi- fication. Lecture Notes in Statistics, vol 171. Springer, New York, NY. 2003, pp. 149–171, doi: https://doi.org/10.1007/978-0-387-21579-2_9. [24] H. Schwenk, Y. Bengio: Boosting Neural Networks, Neural Computation 12.8 (2000), pp. 1869–1887, doi: 10.1162/089976600300015178. [25] M. Tavallaee, E. Bagheri, W. Lu, A. Ghorbani: A detailed analysis of the KDD CUP 99 data set, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications (2009), pp. 1–6. 18