-

Evaluation of Decision Table Decomposition Using Dynamic Programming Classifiers

Michal Mankowski

m.mankowski@stud.elka.pw.edu.pl 0

Tadeusz Luba

luba@tele.elka.pw.edu.pl 1

Cezary Jankowski

c.jankowski@stud.elka.pw.edu.pl 1 0 Warsaw University of Technology, Institute of Radioelectronics , Warsaw , Poland 1 Warsaw University of Technology, Institute of Telecommunications , Warsaw , Poland

34 43

Decision table decomposition is a method that decomposes given decision table into an equivalent set of decision tables. Decomposition can enhance the quality of knowledge discovered from databases by simplifying the data mining task. The paper contains a description of decision table decomposition method and their evaluation for data classification. Additionally, a novel method of obtaining attributes sets for decomposition was introduced. Experimental results demonstrated that decomposition can reduce memory requirements preserving the accuracy of classification.

Data Mining Decomposition Clasifications Dynamic Programming

Increasing amount of data requires to develop new data analysis methods. A common approach in data mining is to make a prediction based on decision tables. Decomposition of a decision table to smaller subtables could be obtained by the divide and conquer strategy. This idea comes from logic synthesis and functional decomposition [ 1 ].

The fundamentals of decision system and logic synthesis are different, but there are many similarities between them. The decision system is usually described by a decision table, and combinational logic of a digital system by a truth table. Input variables of digital systems correspond to conditional attributes. Therefore, multiple terms from logic synthesis may be extended to data mining. Functional decomposition could be used to build a hierarchical decision system.

The functional decomposition was firstly used in logic synthesis of digital systems. In this situation, decomposition involves breaking a large logic functions, which are difficult to implement, into several smaller ones, which can be easy to implement. A similar problem in machine learning relies on disassembling the decision table to the subsystems in such a way that the original decision table can be recreated through a series of operations corresponding to the hierarchical decision making. But the most important is that we can induce noticeably simpler decision rules and trees for the resulting components to finally make the same decision as for the original decision table. [ 2–4 ]

For evaluation of decomposition, the decision trees and rules classifiers based on extensions of dynamic programming [ 5 ] were used. Decision tree and rules were sequentially optimized for different cost functions (for example, relative to number of misclassification and depth of decision trees). In the case of decision trees and rules this approach allowed to describe set of trees or rules by directed acyclic graph (DAG). 2 2.1

Basic Concepts Preliminary Notions

Information system is a pair A = (U, A), where U is a non-empty, finite set of objects called the universe, A is a non-empty, finite set of attributes, i.e. each element a ∈ A is a function from U into Va, where Va is the domain of a called value set of a. Then, the function ρ maps the product of U and A into the set of all values. The value of the attribute for a specific object is denoted by ρ(ut, ai), with ut ∈ U , ai ∈ A

One or more distinguished attributes from set A of information system may indicate a decision from rest of attributes. Such information system is called decision system. Formally, decision system is information system denoted by A = (U, A ∪ D), where A ∩ D = ∅. Attributes in set A are referred to as conditional attributes while attributes in set D are referred to as decision attributes. However, in the case when function ρ maps U × (A ∪ D) into the set of all attribute values such system is called decision table.

Let A = (U, A) be an information system. For each subset B ⊆ A we define B-indiscernibility relation IN DA(B):

IN DA(B) = (up, uq) ∈ U 2 : ∀ai ∈ B, ρ(up, ai) = ρ(uq, ai) (1) The attribute values a1, i.e. ρpi = ρ(up, a1) and ρqi = ρ(uq, a1) are compatible (ρpi ∼ ρqi) if, and only if, ρpi = ρqi or ρpi = ∗ or ρqi = ∗, where "∗" represents attributes value "do not care". In the other case ρpi and ρqi are not compatible (ρpi ≁ ρqi).

The consequence of this definition is compatibility relation COMA(B) associated with every B ⊆ A :

COMA(B) = (up, uq) ∈ U 2 : ∀ai ∈ B, ρ(up, ai) ∼ ρ(uq, ai) (2) COMA(B) classifies objects by grouping them into compatibility classes, i.e. U/COMA(B), where B ⊆ A. Collection of subsets U/COM (B) is called r-partition on U and denote as ΠA(B). R-partition on a set U may be viewed as a collection of non-disjoint subsets of U , where the set union is equal U . All symbols and operations of partition algebra [ 6 ] are applicable to r-partitions. The r-partition generated by a set B is the product of r-partitions generated by the attributes ai ∈ B: ΠA(B) = \ ΠA({ai}) i (3) If B = {ai1, ..., aik}, the product can be expressed as: Π(B) = Π(ai1) · ... · Π(aik). We will write often · instead of ∩.

Hierarchical Decomposition

To compress data and accelerate computations, hierarchical decomposition can be applied. The goal is to break down a decision table into two smaller subtables.

Let F be a functional dependency D = F(A) for a consistent decision system A = (U, A∪D), where A is a set of conditional attributes and D is a set of decision attributes. Let B1, B2 be subsets of A such that A = B1 ∪ B2 and B1 ∩ B2 = ∅. A simple hierarchical decomposition relative to B1, B2 exists for F(A) if and only if: F(A) = H(B1, G(B2)) = H(B1, δ) (4) where G and H represent the following functional dependencies: G(B2) = δ and H(B1, δ) = D, where δ is a intermediate attribute. The output of functions F(A) and H are exactly the same. In other words we try to find a function H depending on the variables of the set B1 as well on the output δ of a function G depending on the set B2.

In Theorem 1., r-partition ΠG represents component G, and the product of rpartitions Π(B1) and ΠG corresponds to H. The decision tables of the resulting components can be easily obtained from these r-partitions.

According to Theorem 1 the main problem is to find a partition ΠG. To solve that problem, a subset of original attributes B2 and an m-block partition Π (B2) = {K1, K2, ..., Km} generated by that subset is appropriate to consider. Two blocks Ki, Kj of partition Π (B2) are compatible if and only if the partition Π ′ obtained from Π (B2) by joining the blocks Ki and Kj into a single block Kij (without changing the ′ other ones) satisfies the equation (5), i.e. iff Π · ΠG ≤ Π (D). Otherwise, Ki, Kj are incompatible.

For decision table from Table 1 and sets of attributes B1 = {a0, a5}, B2 = {a1, a2, a3, a4} the following set of incompatible pairs could be found: E = {(K1, K8), (K2, K4), (K2, K8), (K3, K7)(K4, K5)}. The subset of n partition blocks, Π (B2) = {Ki1 , Ki2 , ..., Kin } where Kij ∈ Π (B2) is the compatible class of Π (B2) partition blocks iff all blocks of that subset are pairwise compatible. The compatibility class is referred to as maximal compatibility class (MCC) iff it does not belong to any other compatibility class of the partition concerned.

K8 K6

K1 K5

K2 K4

The decomposition process may be interpreted in terms of an incompatibility graph (Fig. 1). The edges represent the incompatible pairs of partition Π (B2) : (K1, K8), (K2, K4), (K2, K8), (K3, K7), (K4, K5). It is clearly visible that the proper coloring of the graph specifies the compatible classes: {K1, K2, K3, K5} , {K4, K6, K7, K8} and, as a consequence, the partition ΠG =

0, 1, 2, 4, 5, 7, 9; 3, 6, 8 .

Another approach to building an incompatibility graph is to create a labeled partition matrix [ 7, 8 ] (Table 2). It should be noted that the columns represent all possible combinations of the attributes values in B2. Each column thus denotes the behavior of decision table when the attributes in B1 set are constant. Therefore each column can be treated as object from decision table. To build incompatibility graph it is necessary to apply the equation (2) to each pair of columns. When the compability relation is met then pair is compatible, otherwise it is incompatible. Simple hierarchical decomposition requires to divide a set of conditional attributes A to two disjoint subsets B1 and B2. Proposed idea of obtaining sets is based on the attributes relationship called attributes dependency from Rough Set theory [ 9 ].

Let C and B be sets of attributes, then B depends entirely on a set of attributes C, denoted C ⇒ B, if all values of attributes from B are uniquely determined by the values of attributes from C. If B depends in degree k, 0 ≤ k ≤ 1, on C, then: (6) (7) (8) where k = γ(C, B) = |P OSC (B)|

|U | P OSC (D) =

[ X∈U/B

C∗(X) called a positive region of the partition U/B with respect to C, is the set of all elements of U that can be uniquely classified to blocks of the partition U/B, by means of C.

Proposed method allows us to measure dependency between all possible pairs of conditional attributes and decision attribute. Related dependency of one conditional attribute can be generated from a given information system: A = (U, A ∪ {d}), where A = {a0, ..., ak} is a set of conditional attributes and d is a decision attribute: r(x) =

P0k γ({x, ai} , {d}) |A| , x ∈ A

The above function of related dependency is used for comparison of attributes. This function is being calculated for each attribute, then the results are being sorted by the value of function r. The most dependent attributes are put in set B1, which corresponds to the final decision table H. Example 1. For Table 1 first step of the algorithm is to built a matrix of attribute dependency between each pair of condition attributes and decision attribute. Then the mean of partial results is calculated, which is represented by related dependency r(x) in Table 3. These results can be sorted by value and divided into two equinumerous sets. If the number of attributes is odd, then set |B1| = |B2| + 1. An example of sorting and assignment attributes is presented in Table 4. The calculation of related dependency r(x) allows formulating an accurate method for assessing sets B1 and B2, i.e. B1 = {a1, a3, a5} and B2 = {a0, a2, a4}. Therefore the decomposition is as follows: F(A) = H({a1, a3, a5} , G({a0, a2, a4})).

Classification Schema

Hierarchical Decision Making Due to the decomposition of decision tables, there is a need for hierarchical decision system to evaluate this method for the purpose of classification [ 6 ]. This method is based on disassembling the decision table into the subtables. The most important advantage is the possibility to induce a simpler classification model, for example shorter decision rules or smaller decision tree for the resulting components to finally make the same decision as for the original decision table. Following the process of decomposition, we propose to take decisions hierarchically. For the part of the attributes B2 a prediction model to calculate intermediate decision was built. Then this intermediate decision was used simultaneously with the attributes from B1 to build final classification model. Then, on the basis of both, i.e., these attributes and the intermediate decision δ, the final decision was taken (Fig. 2). For decision prediction, the approach based on an extension of dynamic programming was used. These methods were developed in [ 5 ]. They allow sequential optimization of decision trees and rules relative to different cost functions, in particular between the

B2 G

Intermediate decision H

F Fdiencailsion number of misclassifications and the depth of decision trees or the length of decision rules. Proposed algorithm constructs a directed acyclic graph (DAG), which represents structure of subtables of initial tables. For decision table A separable subtables of A described by systems of equalities of the kind ai = b are considered as subproblems, where ai is an attribute and b is its value. Classification and optimization of decision trees and rules are discussed in details in [ 5, 10 ].

In the applied approach to optimization of decision trees directed acyclic graph (DAG) represents a set of CART-like decision trees [ 11 ]. Set of Pareto optimal points for bi-criteria optimization problem is constructed. Two types of decision tree pruning have been compared. First is the multi-pruning for which, using validation subtable (part of training subtable) for each Pareto optimal point, a decision tree with minimum number of misclassification is found. Second, as an improvement of multi-pruning, is to use only the best split for a small number of attributes in each nodes of DAG graph, instead of using the best split for all attributes. This pruning is called restricted multipruning.

The system of decision rules as a prediction model was also considered. As in case of decision trees we used dynamic programming algorithm to create and optimize decision rules [ 10 ]. 4

Experiments

To evaluate the proposed decomposition algorithm and hierarchical decision making idea the Dagger software system created in King Abdullah University of Science and Technology was used. Proposed algorithm has been tested on categorical datasets from UCI ML Repository [ 12 ]. A data sets were preprocessed. Duplicate rows were removed. There were some inconsistencies, i.e., there are instances with the same values of conditional attributes, but their decisions are different. The solution was to replace such set with a single row with most common decision. Results were obtained using the twofold Cross-Validation evaluation repeated 100 times, each time using a different random selected testing subset. From training part, 70% of rows was used to generate decision trees and remaining part is preserved for validation.

data set rows attributes compression (SD/S) flags 196 27 0.801 house 281 17 0.395 kr-vs-kp 3198 37 0.209 breast cancer 268 10 0.754

cars 1730 7 0.261 spect-test 169 23 0.751 dermatology 366 35 0.352

The advantage of decomposition is due to the fact that two components (i.e. tables G and H) require less memory than the original decision table. Let us express the size of the table as S = n Pi bi, where n is the number of objects, and bi = log2 |Vai | is the number of bits required to represent attribute ai. Then, after the decomposition, we may compare the size of specific components with that of the original table (prior to decomposition). Results of compression are presented in Table 5.

Conclusion

However, for most of measurements, accuracy keeps a significant level or is slightly better. The biggest improvement occurring when dynamic programing rules were used. Effective data aggregation algorithms have been sought after for a long time due to the increasing complexity of databases used in practice. Recently, some suggestions were put forward that decomposition algorithms, previously used mainly in logic synthesis of digital systems, may be applied for that purpose [ 13 ]. This approach is indeed very relevant as decision systems and logic circuits are very similar. Bearing this in mind, this paper demonstrates that a typical algorithm for the decomposition of binary data tables (representing Boolean functions) may be applied to the decomposition of data represented by multi-valued attributes used in decision systems.

The paper indicates the advantages and possibilities of decomposition algorithms for the purpose of classification. Results of experiments performed by proposed decomposition algorithm and Dagger system has been presented. New attributes selection criteria describing partitions for decomposition has been introduced and used in the experiments. Proposed method is particularly efficient in data compression. It allows to build simple classification model and save memory, simultaneously keep the accuracy. To achieved better results in accuracy data set decomposition requires further research, particularity with attributes selection criteria. Also, there is a need to extend the decomposition to deal with continuous attributes and noise in data. Acknowledgments. The authors would like to thank professor Mikhail Moshkov and his team for their support while writing this paper. This research has been supported by King Abdullah University of Science and Technology.

1. Perkowski , M.A. , Luba , T. , Grygiel , S. , Burkey , P. , Burns , M. , Iliev , N. , Kolsteren , M. , Lisanke , R. , Malvi , R. , Wang , Z. , et al.: Unified approach to functional decompositions of switching functions . PSU Electr. Engn. Dept. Report ( 1995 )

2. Łuba , T. , Lasocki , R. , Rybnik , J.: An implementation of decomposition algorithm and its application in information systems analysis and logic synthesis . In Ziarko, W., ed.: Rough Sets, Fuzzy Sets and

Knowledge

Discovery . Workshops in Computing. Springer London ( 1994 ) 458 - 465

3. Luba , T. , Lasocki , R.: On unknown attribute values in functional dependencies . In: Proceedings of the International Workshop on Rough Sets and Soft Computing ,(San Jose, CA). ( 1994 ) 490 - 497

4. Rokach , L. , Maimon , O. : Data mining using decomposition methods . In Maimon, O. , Rokach , L., eds. : Data Mining and Knowledge Discovery Handbook . Springer US ( 2010 ) 981 - 998

5. Alkhalid , A. , Amin , T. , Chikalov , I. , Hussain , S. , Moshkov , M. , Zielosko , B. : Optimization and analysis of decision trees and rules: dynamic programming approach . International Journal of General Systems 42 ( 6 ) ( 2013 ) 614 - 634

6. Borowik , G. In: Data Mining Approach for Decision and Classification Systems Using Logic Synthesis Algorithms . Volume 6. Springer International Publishing ( 2014 ) 3 - 23

7. Zupan , B. , Prof , S. , Bratko, D.I.: Machine learning based on function decomposition . Technical report , University of Ljubljana ( 1997 )

8. Zupan , B. , Bohanec , M. : Experimental evaluation of three partition selection criteria for decision table decomposition . Informatica (Slovenia) 22 ( 2 ) ( 1998 )

9. Pawlak , Z. : Rough sets . International Journal of Computer & Information Sciences 11 ( 5 ) ( 1982 ) 341 - 356

10. Amin , T. , Chikalov , I. , Moshkov , M. , Zielosko , B. : Dynamic programming approach to optimization of approximate decision rules . Information Sciences 221 ( 0 ) ( 2013 ) 403 - 418

11. Breiman , L. , Friedman , J. , Stone , C.J. , Olshen , R.A. : Classification and Regression Trees . CRC press ( 1984 )

12. Lichman , M.:

UCI machine learning repository (

2013 )

13. Maimon , O. , Rokach , L. : Decomposition methodology for knowledge discovery and data mining . In Maimon, O. , Rokach , L., eds. : Data Mining and Knowledge Discovery Handbook . Springer US ( 2005 ) 981 - 1003