<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Survey on Data mining classification approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anupama Mishra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B.B.Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dragan Peraković</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco José García Peñalvo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Technology</institution>
          ,
          <addr-line>Kurukshetra, Haryana 136119</addr-line>
          ,
          <institution>India &amp; Asia University</institution>
          ,
          <addr-line>Taichung 413</addr-line>
          ,
          <institution>Taiwan &amp; Stafordshire University</institution>
          ,
          <addr-line>Stoke-on-Trent ST4 2DE</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Swami Rama Himalayan University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Salamanca</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Zagreb</institution>
          ,
          <country country="HR">Croatia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this review article, we discuss a number of diferent classification algorithms used in data mining for unique applications. There are various techniques to analyse the data for continuous and discrete values. Though,in our research paper, we discuss algorithm used for classification and applied for data mining. Basically classification is a technique for categorising data into discrete categories depending on limitations. The Genetic algorithm C4.5, the Naive Bayes algorithm, and others are examples of classification algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Bagging</kwd>
        <kwd>Naive Bayes</kwd>
        <kwd>SVM</kwd>
        <kwd>Random Forest</kwd>
        <kwd>data mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The practise of identifying previously unknown, valid patterns and links in large data sets
using advanced data analysis tools is known as data mining. These technologies include
statistical models, mathematical algorithms, and machine learning methodologies. There are
wide applications of data mining techniques [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. As a result, data mining includes more than
just data collection and maintenance; it also includes analysis and prediction. The classification
technique, which can handle a larger range of data than regression, is gaining prominence [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Knowledge discovery from datasets is a part of data mining. Data mining tools and methods
are applied to extract patterns and features from large amount of data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which can then be
applied to other datasets[
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ]. Classification is a process that assigns an object or event to one
of the predefined classes in a group. It’s based on their characteristics in order to be able to
predict their future behavior. Classification methods are used when the data set has already
been divided into groups before the classification process begins. The accuracy often depend on
the preprocessing of the data which involves data cleaning (missing values, null values, blank
values), data integration from multiple sources, data transformation and discretizaion [15].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Classification Techniques in Data Mining</title>
      <p>
        Classification Techniques are methods of data analysis that can be used to determine the
categorization of an individual based on their personal attributes [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ]. These techniques help
us better understand individuals by grouping them together depending on their lifestyle, habits,
and traits. Figure 1 presents the classification algorithms which are generally used for data
mining applications.Classification is one of the most commonly used data mining techniques. It
can be used for both categorical and numerical attributes. The goal is to predict the class labels
of new, unseen observations by using training data consisting of both labeled and unlabeled
examples. This method makes use of an algorithm to identify patterns in the training data that
are predictive of new observations [25, 18].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Decision Tree</title>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Naive Bayes</title>
        <p>
          A decision tree is a class discriminator that iteratively splits the training set until each partition
contains only or primarily samples from one class. A split point is a test that describes how
data is partitioned in each non-leaf node of the tree based on one or more qualities[13].
Naive bayes is used to work with probabilistic models and majorly used in machine learning
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In this model, probability is calculated for each class to determine their categorization,
which is then used to forecast the values for a new class. Here, y is a instance of a problem
which has to be classified. A vector can represent it by  = 1, 2, ..... where  represents
independent variables, and assigned to instance probabilities (/(1, 2, .....)) For each of
n possible outcomes or classes cn
(|) = ()(|)
        </p>
        <p>( )</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Rule Based Classification</title>
        <p>"If-then-"rules are the classification rules, and the rule is a condition. The rules of individuals
are ranked. Rule-based order refers to the order that is based on their quality. Class-based
ordering refers to the grouping of rules that belong to the same class. A good rule should be
error-free and cover as many scenarios as possible.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Support Vector Machine</title>
        <p>Support Vector Machines (SVM) [23], is a classification technique that can be used to build
both classifiers and non-parametric regression models. SVM works by finding an optimal
hyperplane that separates objects of diferent classes in the input space based on their training
samples. A new classification method for both linear and non-linear data is the Support Vector
Machine. It transforms the original training data into a higher dimension using a non-linear
mapping. It searches for the linear optimal separation hyper plane with the additional dimension
(i.e. "decision boundary"). A hyper plane with a good non-linear mapping to a high enough
dimension can always divide data into two groups. SVM uses support vectors ("important
training tuples") and margins (specified by the support vectors) to discover this hyper plane.
SVM is used for classification as well as prediction.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Genetic Algorithms</title>
        <p>
          In GA, a technique called association rules mining is utilised to uncover indeterminate solutions
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].GA is implemented with a small collection of categorical data. After GA is implemented,
high-level prediction rules are produced for the selection of better attribute. The Michigan
technique provides a single prediction rule for every individual in the entire population by
lowering the cost [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The Pittsburgh method [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a set of prediction criteria for a whole
group of people. We evaluate the overall quality of the rule set rather than the quality of each
individual rule when categorising. Rules, like the logical OR implementing and logical AND
implementing AND operators, are generalised or specialised based on facts (logical AND).
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Model Evaluation and Selection</title>
      <p>The task of choosing a model is challenging because many models are often equivalent in
terms of accuracy, but have diferent computational complexity. The evaluation and selection
of the best model for the particular application depends on the cost-complexity trade-of. One
alternative in tackling this task is to carry out an exhaustive search over all possible models,
which may be costly in terms of computational time or storage space. Figure 2 depicts the types
of methods used for Evaluation and selection of the model [16].</p>
      <sec id="sec-3-1">
        <title>3.1. Hold-Out</title>
        <p>A technique used to improve classification accuracy is the holdout validation. It will remove
data that was used in training and then split the remaining data into two parts, one for training
and one for testing. This prevents over-fitting of the model on the training set.</p>
        <p>Hold out validation is a technique that can be used to improve classification accuracy. This
method is used by removing any data that was used in training, splitting the remaining data
into two groups (one for testing and one for training), and preventing over-fitting of models on
the test set by using only new data for testing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. n-fold Cross Validation</title>
        <p>The available data is divided into n distinct subsets of equal size. To train a classifier, use each
subset as the training set. The operation is repeated n times, with the given accuracies being
the average of the n accuracies. Cross-validation methods such as 10-fold and 5-fold are often
utilised. When the available data is small, this strategy is employed.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Leave-one-out cross validation</title>
        <p>if the data volume is small , then this method can be used.Cross-validation is a subset of it.
Each cross validation fold contains only one test case, and all of the data’s tests are used in
training [17]. When there are m examples in the original data set, this is referred to as m-fold
cross-validation.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Validation Set</title>
        <p>A validation set is widely used in learning algorithms to estimate parameters. In such instances,
the final parameter values are those that provide the highest accuracy on the validation set.
Cross validation can also be used to estimate parameters.The data may be divided into the below
three sets:
1. Training set
2. Validation set
3. Test set</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Minimum Description Length (MDL)</title>
        <p>Missing values are treated by DL as though they were missing at random. Zero vectors are used
to replace sparse numerical data while zero vectors are used to replace sparse categorical data
[14]. Missing values are considered sparse in nested columns. In the case of MDL, the model’s
size as well as the reduction in uncertainty that using the model causes does matters [22].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Techniques to Improve Classification Accuracy</title>
      <p>Classification Accuracy describes how well a model can assign the correct class to a given input.
Improvements in Classification Accuracy is important for models that are more accurate and
fair as they can reduce the risk of unjustified misclassifications and false alarms.</p>
      <sec id="sec-4-1">
        <title>4.1. Bagging</title>
        <p>The most common technique for improving classification accuracy is to use the bagging
technique [21]. The bagging technique samples training data across multiple training folds (called
bootstrapping) and then uses the resampled data to train the classifier. Boosting is another
popular technique which applies weighting to examples that are misclassified by the initial
classifier. The bagging technique can significantly improve classification accuracy. During
the training phase, each classifier is trained with a subset of the data. This process is called
bootstrapping. Once a classifier has been trained for a specific set of data, it’s used to classify
new data from the same set. In contrast to boosting, wherein each classifier is trained on both
positive and negative examples from the input set, bagging trains each classifier on only one
type of example at a time. The result is an ensemble of classifiers that combine to create an
even better model than any single component could be alone.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Boosting</title>
        <p>It is well-known that boosting algorithms are high-performing classifiers [ 19]. They are versatile
and provide good accuracy when there is a large imbalance in the training data. However, they
have a drawback in that they can be computationally expensive for online or near real-time
processing. Boosting is the process of producing a classifier in a successive manner. Each
classifier dependents on the preceding based one and concentrates on the errors of the before
one. Test sets that have previously been wrongly predicted by classifiers are selected frequently
and are weighted properly. The weights of data that are already categorised will be increased.
Data that are correctly classified will have their weights reduced. Boosting is a machine learning
technique that can be used to improve the classification accuracy of your model. It can be used
in many diferent scenarios, but boosting is most often applied when the goal of the algorithm
is to identify what class (or label) an observation belongs to [20].</p>
        <p>A big issue with boosting models is that they do not always converge well. This results in
algorithms not being able to make accurate predictions. There are many diferent techniques
which can mitigate this issue, one of them being early stopping.</p>
        <p>Boosting models work by accumulating error terms, which are then used to adjust weights
on diferent parts of the model or training data. The more data you have for a given class, the
larger its weight will be in your model’s prediction function and vice versa for other classes.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Occam’s Razor</title>
        <p>Occam’s razor presents the theory that fits our data and identify unfamiliar objects. It says
that if two or more models have same kind of generalisation errors, the simpler model should
be preferred over the more complex model. There is a greater probability that sophisticated
models will be fitted accidently due to data mistakes [22].</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Random Forest</title>
        <p>WEKA is a general-purpose classification and regression tool [ 24]. For gradient boosting and
support vector machine, random forest has a very high accuracy. Random Forest is divided into
two types. 1. Regression and classification trees 2. A bootstrap sample is a sample derived from
the original dataset with replacement that is the same size as the original dataset.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This review discusses numerous data mining classification techniques. each technique has
its own set of advantages and disadvantages. Data mining is a broad term that encompasses
a variety of approaches for analysing vast amounts of data, including many technology like
machine learning and deep learning with the included knowledge of statistics. To accomplish
various data analysis tasks, these areas have a huge number of data mining algorithms built in
them. Based on the behaviour of the data, the algorithm and evaluation methods can be applied
to analyse the data.
[13] Breslow, L. A. &amp; Aha, D. W. (1997). Simplifying Decision Trees:A Survey. Knowledge</p>
      <p>Engineering Review 12: 1–40.
[14] Jensen, F. (1996). An Introduction To Bayesian Networks. Springer.
[15] Gupta, B. B., &amp; Sheng, Q. Z. (Eds.). (2019). Machine learning for computer and cyber
security: principle, algorithms, and practices. CRC Press.
[16] Sahoo, S. R., &amp; Gupta, B. B. (2020). Classification of spammer and nonspammer
content in online social network using genetic algorithm-based feature selection. Enterprise
Information Systems, 14(5), 710-736.
[17] Madden, M. (2003), The Performance Of Bayesian Network Classifiers Constructed
Using Diferent Techniques, Proceedings Of European Conference On Machine Learning,
Workshopon Probabilistic Graphical Models ForClassification, Pp. 59-70.
[18] Avirm Michael Kearns And Dana Ron,”Algorithmic Stability And Sanity-Check Bounds</p>
      <p>For Leave-One-Out Cross Validation”.
[19] Freund, Y. (1995). Boosting A Weak Learning Algorithm By Majority. Information And</p>
      <p>Computation 121, 256285.
[20] Jiang, W. (2000). Process Consistency ForAdaboost. Tech. Report, Dept. Of
Statistics,Northwestern University.
[21] Breiman, Leo, 1996. Bagging Predictors, Machine Learning.
[22] E. Alpaydin.: Voting Over Multiple Condensed Nearest Neoghbors. Artificial Intelligence</p>
      <p>Review 11:115-132, (1997) Kluwer Academic Publishers.
[23] Cristianini, N., Shawe-Taylor, 1.: An Introduction To Support Vector Machines. Cambridge</p>
      <p>University Press, Cambridge, 2000.
[24] L. Breiman, J. H. Friedman, R. A. Olshen, And C. J. Stone.Classification And Regression</p>
      <p>Trees. Wadsworth, Belmont, 1984.
[25] Sahoo, S. R., &amp; Gupta, B. B. (2021). Real-time detection of fake account in twitter using
machine-learning approach. In Advances in computational intelligence and communication
technology (pp. 149-159). Springer, Singapore.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Mamta</surname>
          </string-name>
          (
          <year>2021</year>
          )
          <article-title>Quick Medical Data Access Using Edge Computing, Insights2Techinfo</article-title>
          , pp.
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sandeep</given-names>
            <surname>Kumar</surname>
          </string-name>
          (
          <year>2021</year>
          )
          <article-title>Artificial Intelligence and Machine learning for Smart and Secure Healthcare System</article-title>
          ,
          <source>Insights2Techinfo</source>
          , pp.
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Adil</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grigoriev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B. B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rho</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Training an agent for fps doom game using visual reinforcement learning and vizdoom</article-title>
          .
          <source>International Journal of Advanced Computer Science and Applications</source>
          ,
          <volume>8</volume>
          (
          <issue>12</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>AlZu'bi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shehab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Ayyoub</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jararweh</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Parallel implementation for 3d medical volume fuzzy segmentation</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>130</volume>
          ,
          <fpage>312</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2005</year>
          ),
          <article-title>"</article-title>
          <source>Data Mining: Practical Machine Learning Tools And Techniques", 2nd Edition</source>
          , Morgan Francisco,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Constructing X-Of-N Attributes For Decision Tree Learning</article-title>
          .
          <source>Machine Learning</source>
          <volume>40</volume>
          :
          <fpage>35</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Al-Ayyoub</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , AlZu'bi,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Jararweh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Shehab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            , &amp;
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. B.</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Accelerating 3D medical volume segmentation using GPUs</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          ,
          <volume>77</volume>
          (
          <issue>4</issue>
          ),
          <fpage>4939</fpage>
          -
          <lpage>4958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Goldszmidt</surname>
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <source>Bayesian Network Classifiers.Machine Learning</source>
          <volume>29</volume>
          :
          <fpage>131</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatetsky-Shapiro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>And Smyth P.</surname>
          </string-name>
          , “From Data Mining To Nowledge Discovery In Databases,” Ai Magazine,
          <source>American Association For Artificial Intelligence</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Being Bayesian About Network Structure: A Bayesian Approach To Structure Discovery In Bayesian Networks</article-title>
          .
          <source>Machine Learning</source>
          <volume>50</volume>
          (
          <issue>1</issue>
          ):
          <fpage>95</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <source>C4</source>
          .5
          <article-title>- Programs For Machine Learning</article-title>
          .Morgan Kaufmann Publishers, San Francisco, Ca,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Bianca</surname>
            <given-names>V. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>PhilippeBoula De Mareüil And Martine Adda-Decker</surname>
          </string-name>
          ,
          <article-title>“Identification Of Foreign-Accented French Using Data Mining Techniques</article-title>
          ,
          <source>Computer Sciences Laboratory For Mechanics And Engineering Sciences (Limsi)”.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>