1. Introduction

Classical and Quantum Improvements of Generic Decision Tree Constructing Algorithm for Classification Problem

Kamil Khadiev

kamilhadi@gmail.com 0 1

Ilnaz Mannapov

ilnaztatar5@gmail.com 0

Liliya Safina

liliasafina94@gmail.com 0 0 Kazan Federal University , 18 Kremlyovskaya street , Kazan, 420008 , Russia 1 Zavoisky Physical-Technical Institute, FRC Kazan Scientific Center of RAS , 10/7, Sibirsky tract, Kazan, 420029 , Russia

83 93

In the work, we focus on the complexity of the generic of a decision tree classifier constructing algorithm. The decision tree is constructed in ( ( running time in the classical case, where is a class numbers, is the input data size, is an attributes number, is a tree height. We offer two options for improving the classical version of the generic algorithm, the running time of using these options are ( (general case) and ( (for independent attributes). After that we suggest a quantum improvement, which uses quantum subroutines like the amplitude amplification and the DȕrrHøyer minimum search algorithms. The running time of the quantum algorithms is ( √ ( ) that is better than the complexity of the classical algorithm in the general case.

1 Quantum machine learning quantum decision trees decision tree constructing classification problem

1. Introduction

Presented quantum algorithm uses combination of known quantum algorithms and classical computing. The Dȕrr-Høyer algorithm presented as a subroutine. We use query model algorithms as quantum algorithms. It is based on making a query to a black box which can access to the input. A running time is a number of the black box queries [ 3, 4, 6 ].

In Section 2 we set preliminaries. Section 3 contains the description of the classical version of algorithms. We provide the classical improvements in Section 4 and the new quantum algorithm for decision tree constructing in Section 5.

2. Preliminaries

Let be a training data set and be a set of corresponding classes. One element from is a vector of attributes, where , is a number of attributes, is the set of attributes, is the size of , is a size of the training data set. Let us consider some element . An attribute is a real-valued variable or a categorical variable. Let if is a real value, where or is the number of the partition of training set; and if is a categorical attribute, i.e. for some integer . Let be the value with index of attribute , where for categorical attribute and for realvalued attribute. Let be an index of a class of , where is a number of classes.

Let be a subset from training set which elements are satisfying to the restrictions, which defined by a predicate . For example, is the subset from that belongs to the class with number .

The problem is to construct a function that is called classifier. The function classifies a new vector ( . Let , be the notations to define the set of indexes of categorical and real-valued attributes respectively.

3. The Observation of Generic Algorithm 3.1. The Tree Structure

Decision tree constructing algorithms use a method “divide and conquer” to build a suitable tree. If all vectors in belong to (each vector from belongs to the same class), then the process of a decision tree constructing is stopped, and a leaf is labeled by . In other case, let be some test (with outcomes ) that creates a partition of . A partition is the a of vectors from , it corresponds to .

We consider tests of two types. If is a categorical attribute from , then a test is with outcomes, one for each value from .

If is a real-valued attribute, then a test is with two value options: and '. Here is a value of threshold.

To construct a tree classifier, we consider a structure that has next fields:  is a condition field which indicates that node is a leaf or not;  is an attribute index for non-leaf nodes and a class index for leaf nodes;  is an array of key-valued pairs (defined as ( , where is a predicate and is a corresponding subtree). If an attribute is categorical, then a size of equal to the count of attribute values. On real-valued attributes, contains two items.

Let (Algorithm 1) be the main recursive function of the decision tree constructing process. On each call, the procedure creates a current node and checks of necessity to stop the construction process.

Test Selection Procedure

To maximize a heuristic splitting criterion the generic decision tree constructing algorithm uses a greedy search for selecting a candidate test. In most decision trees inducers, an internal node is split according to the value of a single attribute. The inducer searches for the best attribute upon which to perform the split. There are various univariate criteria which can be characterized in different ways, such as: according to the origin of the measure (Information Theory, Dependence, and Distance); according to the measure structure (Impurity-based criteria, Normalized Impurity-based criteria, and Binary criteria [ 22 ]). Let us briefly review them.

Impurity-based Criteria. Let be a random variable with values, distributed according to ( , a function is an impurity measure that satisfies the following conditions:  ( .  ( is the minimum if exists such that component .    ( ( ( is the maximum if for all , , the following condition is true: . is symmetric with respect to components of . is smooth (differentiable everywhere) in its range. defined as: (

| ( |

| | |

| |

The goodness-of-split due to selected attribute is defined as a reduction in the impurity of the target attribute after splitting according to the values : | (1) The probability vector (from a given training set ) of the set of corresponding classes is | ).

( ( ∑ | | | | |

Normalized Impurity-based Criteria are normalized variants of usual Impurity-based Criteria. Sometimes it is useful to “normalize” the impurity-based measures. The famous decision tree constructing algorithms such as ID3, C4.5, C5.0, CART use impurity based-criteria, and normalized impurity-based criteria.

Binary Criteria are used to build binary decision trees. These measures are based on a division of the input attribute domain into two subdomains.

In this work, we consider impurity-based and normalized impurity-based criteria.

Let us provide some information about usage of impurity based criteria and applying our improvements in practical cases. The decision tree model with impurity-based and normalized impurity-based criteria is used in the industry. Such algorithms as ID3, C4.5, C5.0, CART are used in well-known data processing frameworks, PC programs, and other machine learning algorithms as subroutines.

ID3, CART use impurity-based criteria, C4.5, C5.0 use normalized impurity based criteria. Criteria of ID3, C4.5, and C5.0 are based on the Entropy criterion. CART uses the Gini Index as the impurity criterion. ( for CART is defined by the formula:

For ID3, C4.5, and C5.0 ( ( ) is defined by the next formula: ( ∑ ( | | | | | ) | ( ∑ | | | | | | 3.3.

Attributes Processing

This subsection is based on the open-sourced code of C5.0 [ 2 ].

The function uses abstract impurity-based criterion constructed by Formula 1. It calculates a reduction of impurity after a split. Categorical and real-valued attributes are processed differently. It checks what kind of the input attribute and calls (Algorithm 4) or (Algorithm 3).

The arguments of ( is an index of the processed attribute and a training set . It returns a triple ( . Here is a maximal impurity reduction value, is a set of subsets from training set splitting by selected attribute, is selected threshold value. The variables and are used only for real-valued attributes.

For considering this process we have to describe an abstract impurity based function. It can be described with the next formula: (

∑ ̃ (

Note that is some function with ( running time. It is specific for any impurity-based criterion. Formula 2 is needed for analyzing the running time of this algorithm.

Let us provide some detailed information about processing of a real-valued attribute. Firstly, the algorithm sorts a subset by . It is made by the procedure ( . Note that the indexes in a result sorted order are ( , where | |. Now we can split vectors by (2) ). After then there are two sets { } and { }, for .

The second step is computing a number of elements corresponding to each class.

Let [| | | |] be a sequence that contains object numbers calculated for is a number of vectors such that for , Let

|] be a sequence that contains object numbers calculated for reversed , is a number of vectors such that for Let ({ }) and ({ }), where , , . The value is used for pre-counting of an impurity for any threshold. is used for pre-counting of an impurity for any threshold from the back side of the training set. These values are calculated using Formula 2. The value is a prefix and is a suffix, result value is .

It is made by these formulas: | , |.

| ∑ ̃ ( ∑ ̃ (

The last step is choosing a maximum ( . As result we get and ( .

) )

( and , where (

( , ( |.

, ,

Let us describe processing of a categorical attribute from . We split all elements of according to the attribute value. After that we can compute the value of the objective function. So , for . All vectors of are processed one by one.

Let us consider the processing of current -th vector such that and . Let us describe the variables used in the processing of categorical attributes. Let be a size of ; | | be a count of elements from that belongs to the class ; | | be a number of vectors from that belongs to the -th class; be a notation of impurity value ( ; be an impurity of .

These variables contain values after processing -th vector and contain values before processing -th vector. The final values of the variables will be after processing all | | variables. We recalculate each variable according to the formulas (only variables that depend on and are changed): ( ̃( ) ̃(

)) ( ̃( ) ̃( )) In the end, the procedure computes an impurity reduction by Formula 1. Finally, we obtain the procedure from Algorithm 4. 3.4.

Running Time of the Generic Tree Constructing Algorithm

Remind, that is a set of indexes of numeric attributes (real-valued attributes) and indexes of categorical attributes.

Theorem 1

The running time of the generic tree constructing algorithm is ( ( Appendix A).

4. Improvement of the Classical Algorithm

Let us discuss an approach used for a classical improvement.

4.1.A Fast Tree-Growing Algorithm

We consider a Fast Tree-Growing Algorithm [ 25 ] for the Classical Generic Decision Tree Constructing algorithm. It is based on attribute independence assumption. This approach cannot be applied to all cases of classification problems. On the other hand, many practical cases can be solved faster because of the assumption of attribute independence. Remind that the key moment of decision tree constructing algorithms with impurity based criteria is information gain calculation. It is evaluated by the Formula 1.

Let consider ( for CART and ID3-family that is defined by the formulas: ( is a set of . (See ∑ ( ( and ( ∑ ( ( , where For training set partition we can define ( ( | | | | | | . | | .

The tree-growing process is a recursive process of splitting of the training data. Let be the training data associated with the considered node. Let us to make another view to the problem. The value ( actually can be replaced by conditional probability ( | ) on the input training data, where and is an assignment of values to the variables in is the set of attributes along the path from the current node to the root, called path attributes, . Similarly, ( is ( | ) on the entire training data.

In the process of tree-growing each candidate attribute (the attributes not in ) is evaluated using Equation 1, and the one with the highest information gain is selected as the attribute for splitting. The most time-consuming part in this process is evaluating ( | ) for computing ( . It must pass through each instance in

, for each of which it iterates through each candidate attribute . This results in a running time of (| | . The union of the subsets on each level of the tree is the input data set that has a size equals to , and the running time for each level is ( . Therefore, the classical decision-tree learning algorithm has a running time of ( , where is a height of tree or a count of levels.

The key observation is the ability to skip of passing through estimate ( | ). According to probability theory, we have ( | ) ( | ) ( |

( | ) ∑ ( | ) ( | )

Suppose, that each candidate attribute is independent of the path attribute assignment class, i.e., ( | ) ( | ).

Then we have for each candidate attribute to ) (

According to the paper [ 25 ], the information gain calculated by Equations 3 and 1 is called independent information gain ( ). Note that in Equation 3, ( | ) is the percentage of instances and class number on the entire training data that can be precomputed and stored with a running time of ( before the tree-growing process with an additional space increase of ( , and ( | ) is the percentage of instances belonging to class in that can be computed by passing through once taking (| | . Thus, at each level, the running time for computing ( | ) using Equation 3 is ( .

The value in Equation 1 should be computed for computing . If we examine the | | partition for each candidate attribute , the corresponding running time would be ( .

Fortunately, can be approximated by ∑ ( | ) ( | ) taking ( .

| |

The running time for selecting the splitting attribute using is similar to using information gain in C4.5. ( should be computed for each candidate attribute, it takes ( for each node. The total running time for splitting attribute selection on the entire tree is ( , where is the number of internal nodes on the tree. Note that depends on (height of the tree), and it is a parameter of the algorithm. Note can be bounded by , because a number of rules in the tree cannot be more than a size of a training set. Thus, the total running time is ( .

The total time for tree-growing is the sum of the time for probability estimation, partition, and splitting attribute selection. As result, the running time for tree-growing using is ( .

Note that in the Algorithm 5, we do not cope with real-valued attributes for simplicity, we process real-valued attributes in the following way. In preprocessing, all real-valued attributes are discretized by -bin discretization, where √ .

Note, that the splitting attribute real-valued attributes are treated the same as categorical attributes in selecting process.

Once a real-valued attribute is chosen, a splitting point is found using the same way as in C4.5. Note that a real-valued attribute could be chosen again in the attribute selection on descendant nodes. For processing real-valued attributes this algorithm uses additional time as a classical version of generic decision tree constructing algorithm (see Algorithm 3). In particular, we need to sort data set for selecting thresholds and splitting data by selected real-valued attribute.

4.2.Using a Self-balancing Binary Search Tree

We use such data structure as a self-balancing binary search tree to store and . As a self-balancing binary search tree, we can use the Red-Black tree [12] or the AVL tree [ 5 ]. A selfbalancing binary search tree contains only indexes with a non-zero value, and other values are zero. The running time of adding a new index (key) to the data structure is ( , where is a number of indexes with non-zero values. The running time of removing and inserting is the same. The running time of removing all indexes from the data structure is ( .

Theorem 2

The running time of generic decision tree constructing algorithm that uses a Self-balancing binary search tree is ( . (See Appendix B).

5. Quantum Improvement

We use the Dȕrr-Høyer's algorithm for maximum search [ 10 ] and modification of Grover's search algorithm [ 7 ]. This quantum algorithm help us to speed up decision tree building process.

Lemma 1

Let function be a function that the running time of computing ( is ( . A quantum algorithm can be constructed that finds argument of maximal ( , the expected running time of the algorithm is (√ (

) and the success probability is at least .

Using this lemma we can replace the maximum search in function and use as a function . We call the . For reducing an error probability, we repeat the maximum finding process times. After that we choose the best solution. The procedure is bellow (Algorithm 6).

Theorem 3

The running time of the quantum algorithms is ( ( √ ). The success probability of the quantum algorithms is ( ), where is a number of inner nodes of a tree. (See Appendix C).

6. Conclusion

We suggest a version of the generic decision tree constructing algorithm with a self-balancing tree which works faster than known classical algorithms. After that, we have presented the quantum version of the generic decision tree constructing algorithm for classification problem. Our algorithm works in ( ( √ ) versus ( ( in classical generic case.

7. Acknowledgements

A part of the reported study was funded by RFBR according to the research project No.20-3770080. The research from Section 4.1 is funded by the subsidy allocated to Kazan Federal University for the state assignment in the sphere of scientific activities, project No. 0671-2020-0065. 8. References [1]. Decision trees. https://scikit-learn.org/stable/modules/tree.html.

9. Appendix A The Proof of Theorem 1 Theorem 1

The running time of the generic tree constructing algorithm is (

Proof

( .

The subroutine takes the main time. That is why we focus on analyzing this procedure.

The running time for computing element counts by classes for real-valued attributes is (| | . The running time for the subroutine is (| | | | . The running time of computing the best reduction for one threshold is ( . The running time of calculating the best reduction for all thresholds is (| | . Additionally, we should initialize array that takes ( . The total complexity of this processing a real-valued attribute is ( | | | | .

Let us consider a discrete-valued attribute. The running time of cases processing is (| | . An impurity reduction ( for some discrete attribute is calculated with ( running time, where is a number of attribute values. An impurity before cutting ( is calculated with ( running time, an impurity after cutting is calculated in ( . Therefore, the running time of processing of one discrete-valued attribute is (| | .

Note that if we consider all sets of one level of the decision tree, then we collect all elements of . Therefore, the total complexity for one level is ( ( , and the total complexity for the whole tree is ( ( .

B The Proof of Theorem 2 Theorem 2

The running time of generic decision tree constructing algorithm which is based on Self-balancing binary search tree is ( .

Proof

The proof of this theorem is followed from the proof of Theorem 1. On calculating the values and an algorithm should reassign the unchanged values for every class on each new object processing, then this procedure takes (| | steps. With this improvement, we can skip this reassigning operations and the running time for processing a real-valued attribute becomes ( = ( , and for a discrete-valued attribute, it is ( because we process each vector one by one and recompute variables that take only ( steps for updating values of and ( steps for other actions. Therefore, the total complexity is ( .

C The Proof of Theorem 3 Theorem 3

The running time of the quantum algorithms is ( ( √ ) . The success probability of the quantum algorithms is ( ), where is a number of inner nodes (not leaves).

Proof

The running time of searching is (√ | | | | | ( ( ( √ | ).

is (| | | | . So the running time of maximum | |). With repeating the algorithm, the running time is (√ ) . If we sum the running time for all nodes, then we obtain The success probability of the Dȕrr-Høyer's algorithm is . We call it ( times and choose a maximum among (

values of gain ratios. Then, we find a correct attribute for one node with a success probability ( ) ( ). We should find correct attributes for all nodes except leaves. Thus, the success probability for the whole tree is equal to (( ) ) ( ), where is a number of internal nodes (not leaves).

[2]. C5 . 0: An informal tutorial ( 2019 ), url=https://www.rulequest.com/see5-unix. html [3]

. F.

Ablayev , Ablayev

, Huang

Khadiev ,

Salikhova ,

Wu , On quantum methods for machine learning problems part i: Quantum tools . Big Data Mining and Analytics pp. 41 - 55 ( 2019 )

]. F.

Ablayev , Ablayev

, Huang

Khadiev ,

Salikhova ,

Wu , On quantum methods for machine learning problems part ii: Quantum classification algorithm . Big Data Mining and Analytics pp. 56 - 67 ( 2019 )

[5]. G. M. Adel'son-Vel'skii and

E. M.

Landis . An algorithm for organization of information . In Doklady Akademii Nauk , volume 146 , pages 263 - 266 . Russian Academy of Sciences, 1962 .

[6]

. A.

Ambainis . Understanding quantum algorithms via query complexity . arXiv:1712.06349 , 2017 .

[7]. G. Brassard,

Hyer ,

Mosca , and

Tapp . Quantum amplitude amplification and estimation . Contemporary Mathematics , 305 : 53 - 74 , 2002 .

[8]. T. H Cormen,

C. E

Leiserson ,

R. L

Rivest , and

Stein . Introduction to Algorithms. McGrawHill, 2001 .

[9]. Ronald De Wolf. Quantum computing and communication complexity . 2001 .

[10].

Durr and

Hoyer . A quantum algorithm for finding the minimum . arXiv:quant-ph/9607014 , 1996 .

[11].L. Grover , A fast quantum mechanical algorithm for database search . In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pp. 212 - 219 ( 1996 ) [ 12 ] .L. J Guibas and R. Sedgewick , A dichromatic framework for balanced trees . In Proceedings of SFCS 1978 , pages 8 - 21 . IEEE, 1978 .

[13].

Jordan , Bounded error quantum algorithms zoo . https://math.nist.gov/quantum/zoo.

[14].

Khadiev ,

Kravchenko , and

Serov , On the quantum and classical complexity of solving subtraction games . In Proceedings of CSR 2019 , volume 11532 of LNCS , pages 228 - 236 . 2019 .

[15].

Khadiev and

Safina , Quantum algorithm for dynamic programming approach for dags. applications for Zhegalkin polynomial evaluation and some problems on DAGs . In Proceedings of UCNC 2019 , volume 4362 of LNCS , pages 150 - 163 . 2019 .

[16

].R.

Kohavi and

J. R.

Quinlan , Data mining tasks and methods: Classification: decision-tree discovery. Handbook of data mining and knowledge discovery . Oxford University Press, 2002 .

[17]. D. Kopczyk, Quantum machine learning for data scientists . arXiv preprint arXiv:1804.10068 , 2018 .

[18].L. Breiman , J. H.

Friedman , R. A.

Olshen , C.

J and Stone, Classification and regression trees , 1984 .

[19] .M. A Nielsen and I. L Chuang . Quantum computation and quantum information . Cambridge univ. press, 2010 .

[20].

J. R.

Quinlan , Induction of decision trees . Machine learning , pages 81 - 106 , 1986 .

[21].

J. R.

Quinlan . Improved use of continuous attributes in c4.5 . Journal of Artificial Intelligence Research , pages 77 - 90 , 1996 .

[22].

Rokach and

Maimon , Data mining with decision trees: theory and applications , World Scientific Publishing Co. Pte. Ltd , 2015 .

[23].

Schuld , I. Sinayskiy , and

Petruccione , The quest for a quantum neural network . Quantum Information Processing , 13 ( 11 ): 2567 - 2586 , 2014 .

[24].

Schuld , I. Sinayskiy , and

Petruccione , An introduction to quantum machine learning . Contemporary Physics , 56 ( 2 ), 172 - 185 , 2015 .

[25].

Su and

Zhang , A fast decision tree learning algorithm , volume 1 , 2006 .