Evaluation of Decision Table Decomposition Using
           Dynamic Programming Classifiers

                Michal Mankowski1 , Tadeusz Luba2 , and Cezary Jankowski2
      1
        Warsaw University of Technology, Institute of Radioelectronics, Warsaw, Poland
                       m.mankowski@stud.elka.pw.edu.pl
    2
       Warsaw University of Technology, Institute of Telecommunications , Warsaw, Poland
      luba@tele.elka.pw.edu.pl, c.jankowski@stud.elka.pw.edu.pl


          Abstract. Decision table decomposition is a method that decomposes given de-
          cision table into an equivalent set of decision tables. Decomposition can enhance
          the quality of knowledge discovered from databases by simplifying the data min-
          ing task. The paper contains a description of decision table decomposition method
          and their evaluation for data classification. Additionally, a novel method of ob-
          taining attributes sets for decomposition was introduced. Experimental results
          demonstrated that decomposition can reduce memory requirements preserving
          the accuracy of classification.

          Key words: Data Mining, Decomposition, Clasifications, Dynamic Program-
          ming


1   Introduction

Increasing amount of data requires to develop new data analysis methods. A common
approach in data mining is to make a prediction based on decision tables. Decomposi-
tion of a decision table to smaller subtables could be obtained by the divide and conquer
strategy. This idea comes from logic synthesis and functional decomposition [1].
    The fundamentals of decision system and logic synthesis are different, but there are
many similarities between them. The decision system is usually described by a deci-
sion table, and combinational logic of a digital system by a truth table. Input variables
of digital systems correspond to conditional attributes. Therefore, multiple terms from
logic synthesis may be extended to data mining. Functional decomposition could be
used to build a hierarchical decision system.
    The functional decomposition was firstly used in logic synthesis of digital systems.
In this situation, decomposition involves breaking a large logic functions, which are
difficult to implement, into several smaller ones, which can be easy to implement. A
similar problem in machine learning relies on disassembling the decision table to the
subsystems in such a way that the original decision table can be recreated through a
series of operations corresponding to the hierarchical decision making. But the most
important is that we can induce noticeably simpler decision rules and trees for the re-
sulting components to finally make the same decision as for the original decision table.
[2–4]
                                                                                          35

    For evaluation of decomposition, the decision trees and rules classifiers based on
extensions of dynamic programming [5] were used. Decision tree and rules were se-
quentially optimized for different cost functions (for example, relative to number of
misclassification and depth of decision trees). In the case of decision trees and rules this
approach allowed to describe set of trees or rules by directed acyclic graph (DAG).


2     Basic Concepts
2.1   Preliminary Notions
Information system is a pair A = (U, A), where U is a non-empty, finite set of objects
called the universe, A is a non-empty, finite set of attributes, i.e. each element a ∈ A
is a function from U into Va , where Va is the domain of a called value set of a. Then,
the function ρ maps the product of U and A into the set of all values. The value of the
attribute for a specific object is denoted by ρ(ut , ai ), with ut ∈ U , ai ∈ A
     One or more distinguished attributes from set A of information system may indicate
a decision from rest of attributes. Such information system is called decision system.
Formally, decision system is information system denoted by A = (U, A ∪ D), where
A ∩ D = ∅. Attributes in set A are referred to as conditional attributes while attributes
in set D are referred to as decision attributes. However, in the case when function ρ
maps U × (A ∪ D) into the set of all attribute values such system is called decision
table.
     Let A = (U, A) be an information system. For each subset B ⊆ A we define
B-indiscernibility relation IN DA (B):

            IN DA (B) = (up , uq ) ∈ U 2 : ∀ai ∈ B, ρ(up , ai ) = ρ(uq , ai )
                            
                                                                                      (1)

The attribute values a1 , i.e. ρpi = ρ(up , a1 ) and ρqi = ρ(uq , a1 ) are compatible (ρpi ∼
ρqi ) if, and only if, ρpi = ρqi or ρpi = ∗ or ρqi = ∗, where "∗" represents attributes
value "do not care". In the other case ρpi and ρqi are not compatible (ρpi ≁ ρqi ).
     The consequence of this definition is compatibility relation COMA (B) associated
with every B ⊆ A :

            COMA (B) = (up , uq ) ∈ U 2 : ∀ai ∈ B, ρ(up , ai ) ∼ ρ(uq , ai )
                            
                                                                                          (2)

COMA (B) classifies objects by grouping them into compatibility classes, i.e.
U/COMA (B), where B ⊆ A. Collection of subsets U/COM (B) is called r-partition
on U and denote as ΠA (B). R-partition on a set U may be viewed as a collection of
non-disjoint subsets of U , where the set union is equal U . All symbols and operations
of partition algebra [6] are applicable to r-partitions. The r-partition generated by a set
B is the product of r-partitions generated by the attributes ai ∈ B:
                                             \
                                 ΠA (B) =       ΠA ({ai })                              (3)
                                              i

If B = {ai1 , ..., aik }, the product can be expressed as: Π(B) = Π(ai1 ) · ... · Π(aik ).
We will write often · instead of ∩.
36

2.2   Hierarchical Decomposition
To compress data and accelerate computations, hierarchical decomposition can be ap-
plied. The goal is to break down a decision table into two smaller subtables.
    Let F be a functional dependency D = F(A) for a consistent decision system A =
(U, A∪D), where A is a set of conditional attributes and D is a set of decision attributes.
Let B1 , B2 be subsets of A such that A = B1 ∪ B2 and B1 ∩ B2 = ∅. A simple
hierarchical decomposition relative to B1 , B2 exists for F(A) if and only if:

                          F(A) = H(B1 , G(B2 )) = H(B1 , δ)                            (4)

where G and H represent the following functional dependencies: G(B2 ) = δ and
H(B1 , δ) = D, where δ is a intermediate attribute. The output of functions F(A) and
H are exactly the same. In other words we try to find a function H depending on the
variables of the set B1 as well on the output δ of a function G depending on the set B2 .


                 Table 1. An example decision table A = (U, A ∪ D)

                                  a0 a1 a2 a3 a4 a5 d
                                 00 0 0 0 0 0 1
                                 10 0 1 1 0 0 1
                                 21 2 2 0 1 1 2
                                 30 1 1 0 0 1 2
                                 40 1 0 2 0 1 3
                                 51 2 2 3 2 0 2
                                 61 2 2 2 0 1 1
                                 70 0 1 1 0 1 3
                                 80 1 0 3 2 0 4
                                 92 2 2 3 2 0 4


    Partition-based representation of data tables can be used to describe decomposition
algorithms [6, 3]

Theorem 1. ([3]) Functional dependencies G and H represent hierarchical decompo-
sition of function F(A) = H(B1 , G(B2 )), if there exists a r-partition ΠG ≥ Π(B2 )
such that:

                                 Π(B1 ) · ΠG ≤ Π(D)                                    (5)
where all the r-partitions are over the set of objects, and the number of values of com-
ponent G is equal to L(ΠG ) where L(Π) denotes the number of blocks of r-partition
Π.

    In Theorem 1., r-partition ΠG represents component G, and the product of r-
partitions Π(B1 ) and ΠG corresponds to H. The decision tables of the resulting com-
ponents can be easily obtained from these r-partitions.
                                                                                           37

    According to Theorem 1 the main problem is to find a partition ΠG . To solve
that problem, a subset of original attributes B2 and an m-block partition Π(B2 ) =
{K1 , K2 , ..., Km } generated by that subset is appropriate to consider. Two blocks
Ki , Kj of partition Π(B2 ) are compatible if and only if the partition Π ′ obtained from
Π(B2 ) by joining the blocks Ki and Kj into a single block Kij (without changing the
other ones) satisfies the equation (5), i.e. iff Π ′ · ΠG ≤ Π(D). Otherwise, Ki , Kj are
incompatible.
    For decision table from Table 1 and sets of attributes B1 = {a0 , a5 }, B2 =
{a1 , a2 , a3 , a4 } the following set of incompatible pairs could be found:
E = {(K1 , K8 ), (K2 , K4 ), (K2 , K8 ), (K3 , K7 )(K4 , K5 )}. The subset of n partition
blocks, Π(B2 ) = {Ki1 , Ki2 , ..., Kin } where Kij ∈ Π(B2 ) is the compatible class
of Π(B2 ) partition blocks iff all blocks of that subset are pairwise compatible. The
compatibility class is referred to as maximal compatibility class (MCC) iff it does not
belong to any other compatibility class of the partition concerned.


                                              K1

                                    K8                  K2


                               K7                             K3


                                    K6                   K4

                                              K5


                      Fig. 1. Incompatibility graph and its coloring


    The decomposition process may be interpreted in terms of an incompatibility graph
(Fig. 1). The edges represent the incompatible pairs of partition Π(B2 ) : (K1 , K8 ),
(K2 , K4 ), (K2 , K8 ), (K3 , K7 ), (K4 , K5 ). It is clearly visible that the proper coloring
of the graph specifies the compatible classes: {K1 , K2 , K3 , K5 } , {K4 , K6 , K7 , K8 }
and, as a consequence, the partition
                                
                            ΠG = 0, 1, 2, 4, 5, 7, 9; 3, 6, 8 .

    Another approach to building an incompatibility graph is to create a labeled parti-
tion matrix [7, 8] (Table 2). It should be noted that the columns represent all possible
combinations of the attributes values in B2 . Each column thus denotes the behavior of
decision table when the attributes in B1 set are constant. Therefore each column can be
treated as object from decision table. To build incompatibility graph it is necessary to
apply the equation (2) to each pair of columns. When the compability relation is met
then pair is compatible, otherwise it is incompatible.
38

Table 2. Partition matrix for the example set from Table 1 and the sets of attributes
B1 = {a0 , a5 } and B2 = {a1 , a2 , a3 , a4 }

                     a1    0       0     2        1        1    2    2      1
                     a2    0       1     2        1        0    2    2      0
                     a3    0       1     0        0        2    3    2      3
          a0   a6    a4    0       0     1        0        0    2    0      2
          0    0           1       1     ∗        ∗        ∗    ∗    ∗      4
          1    1           ∗       ∗     2        ∗        ∗    ∗    1      ∗
          0    1           ∗       3     ∗        2        3    ∗    ∗      ∗
          1    0           ∗       ∗     ∗        ∗        ∗    2    ∗      ∗
          2    0           ∗       ∗     ∗        ∗        ∗    4    ∗      ∗
                           K1      K1    K3       K4       K5   K6   K7     K8


2.3    Attributes Selection Criteria for Further Decomposition

Simple hierarchical decomposition requires to divide a set of conditional attributes A
to two disjoint subsets B1 and B2 . Proposed idea of obtaining sets is based on the
attributes relationship called attributes dependency from Rough Set theory [9].
     Let C and B be sets of attributes, then B depends entirely on a set of attributes
C, denoted C ⇒ B, if all values of attributes from B are uniquely determined by the
values of attributes from C. If B depends in degree k, 0 ≤ k ≤ 1, on C, then:

                                                 |P OSC (B)|
                             k = γ(C, B) =                                            (6)
                                                     |U |
      where
                                                 [
                             P OSC (D) =               C∗ (X)                         (7)
                                            X∈U/B

    called a positive region of the partition U/B with respect to C, is the set of all
elements of U that can be uniquely classified to blocks of the partition U/B, by means
of C.
    Proposed method allows us to measure dependency between all possible pairs of
conditional attributes and decision attribute. Related dependency of one conditional at-
tribute can be generated from a given information system: A = (U, A ∪ {d}), where
A = {a0 , ..., ak } is a set of conditional attributes and d is a decision attribute:
                                   Pk
                                     0 γ({x, ai } , {d})
                          r(x) =                           ,x ∈ A                     (8)
                                           |A|
    The above function of related dependency is used for comparison of attributes. This
function is being calculated for each attribute, then the results are being sorted by the
value of function r. The most dependent attributes are put in set B1 , which corresponds
to the final decision table H.
                                                                                         39

Example 1. For Table 1 first step of the algorithm is to built a matrix of attribute depen-
dency between each pair of condition attributes and decision attribute. Then the mean
of partial results is calculated, which is represented by related dependency r(x) in Ta-
ble 3. These results can be sorted by value and divided into two equinumerous sets.
If the number of attributes is odd, then set |B1 | = |B2 | + 1. An example of sorting
and assignment attributes is presented in Table 4. The calculation of related depen-
dency r(x) allows formulating an accurate method for assessing sets B1 and B2 , i.e.
B1 = {a1 , a3 , a5 } and B2 = {a0 , a2 , a4 }. Therefore the decomposition is as follows:
F(A) = H({a1 , a3 , a5 } , G({a0 , a2 , a4 })).


Table 3. Matrix of attributes depen-          Table 4. Attributes for partitions
dency                                         G(B2 ) and H(B1 )

       a0    a1    a2    a3    a4    a5                    a2 0.233
  a0 0      0.1 0.1 0.6 0.5 0.2                            a0 0.250 B2
  a1 0.1     0    0.2 0.6 0.3 0.4                          a4 0.250
  a2 0.1 0.2       0    0.6 0.3 0.2                        a5 0.267
  a3 0.6 0.6 0.6         0    0.1 0.5                      a1 0.267 B1
  a4 0.5 0.3 0.3 0.1           0    0.3                    a3 0.400
  a5 0.2 0.4 0.2 0.5 0.3             0
r(x) 0.250 0.267 0.233 0.400 0.250 0.267


3     Classification Schema
3.1   Hierarchical Decision Making
Due to the decomposition of decision tables, there is a need for hierarchical decision
system to evaluate this method for the purpose of classification [6]. This method is
based on disassembling the decision table into the subtables. The most important ad-
vantage is the possibility to induce a simpler classification model, for example shorter
decision rules or smaller decision tree for the resulting components to finally make the
same decision as for the original decision table. Following the process of decompo-
sition, we propose to take decisions hierarchically. For the part of the attributes B2 a
prediction model to calculate intermediate decision was built. Then this intermediate
decision was used simultaneously with the attributes from B1 to build final classifica-
tion model. Then, on the basis of both, i.e., these attributes and the intermediate decision
δ, the final decision was taken (Fig. 2).

3.2   Dynamic Programing Classifiers
For decision prediction, the approach based on an extension of dynamic programming
was used. These methods were developed in [5]. They allow sequential optimization
of decision trees and rules relative to different cost functions, in particular between the
40

                                   B1              B2


                                                   G
                                                       Intermediate
                                                       decision


                                           H

                                               Final
                                          F    decision


                 Fig. 2. Two-step implementation of the decision table


number of misclassifications and the depth of decision trees or the length of decision
rules. Proposed algorithm constructs a directed acyclic graph (DAG), which represents
structure of subtables of initial tables. For decision table A separable subtables of A
described by systems of equalities of the kind ai = b are considered as subproblems,
where ai is an attribute and b is its value. Classification and optimization of decision
trees and rules are discussed in details in [5, 10].
    In the applied approach to optimization of decision trees directed acyclic graph
(DAG) represents a set of CART-like decision trees [11]. Set of Pareto optimal points
for bi-criteria optimization problem is constructed. Two types of decision tree pruning
have been compared. First is the multi-pruning for which, using validation subtable
(part of training subtable) for each Pareto optimal point, a decision tree with minimum
number of misclassification is found. Second, as an improvement of multi-pruning, is
to use only the best split for a small number of attributes in each nodes of DAG graph,
instead of using the best split for all attributes. This pruning is called restricted multi-
pruning.
    The system of decision rules as a prediction model was also considered. As in case
of decision trees we used dynamic programming algorithm to create and optimize de-
cision rules [10].


4    Experiments
To evaluate the proposed decomposition algorithm and hierarchical decision making
idea the Dagger software system created in King Abdullah University of Science and
Technology was used. Proposed algorithm has been tested on categorical datasets from
UCI ML Repository [12]. A data sets were preprocessed. Duplicate rows were removed.
There were some inconsistencies, i.e., there are instances with the same values of con-
ditional attributes, but their decisions are different. The solution was to replace such set
with a single row with most common decision. Results were obtained using the two-
fold Cross-Validation evaluation repeated 100 times, each time using a different random
selected testing subset. From training part, 70% of rows was used to generate decision
trees and remaining part is preserved for validation.
                                                                                      41

Table 5. Compression results, where S is size of original decision table and SD is a
sum of tables after decomposition (H and G).

                     data set rows attributes compression (SD /S)
                       flags    196    27           0.801
                      house     281    17           0.395
                     kr-vs-kp 3198     37           0.209
                  breast cancer 268    10           0.754
                        cars    1730    7           0.261
                    spect-test 169     23           0.751
                  dermatology 366      35           0.352


     The advantage of decomposition is due to the fact that two components (i.e. tables
G and H) require less memory
                       P        than the original decision table. Let us express
                                                                               the size
of the table as S = n i bi , where n is the number of objects, and bi = log2 |Vai |
is the number of bits required to represent attribute ai . Then, after the decomposition,
we may compare the size of specific components with that of the original table (prior to
decomposition). Results of compression are presented in Table 5.


       Table 6. Comparison of accuracy error with and without decomposition.

    data set   rows attributes decomp.    dp rules     dp tree MP dp tree RMP
                                  no   0.402 ± 0.041 0.409 ± 0.047 0.399 ± 0.041
      flags     196     27
                                 yes 0.402 ± 0.041 0.404 ± 0.038 0.417 ± 0.045
                                  no   0.092 ± 0.013 0.064 ± 0.008 0.067 ± 0.009
     house      281     17
                                 yes 0.076 ± 0.045 0.080 ± 0.047 0.073 ± 0.031
                                  no   0.019 ± 0.003 0.015 ± 0.004 0.062 ± 0.003
    kr-vs-kp 3198       37
                                 yes 0.061 ± 0.015 0.064 ± 0.016 0.011 ± 0.015
                                  no   0.287 ± 0.022 0.302 ± 0.023 0.304 ± 0.026
 breast cancer 268      10
                                 yes 0.274 ± 0.024 0.295 ± 0.020 0.297 ± 0.029
                                  no   0.070 ± 0.009 0.062 ± 0.010 0.068 ± 0.004
       cars    1730      7
                                 yes 0.122 ± 0.013 0.159 ± 0.022 0.112 ± 0.018
                                  no   0.048 ± 0.025 0.048 ± 0.019 0.047 ± 0.000
   spect-test 169       23
                                 yes 0.056 ± 0.022 0.051 ± 0.016 0.049 ± 0.005
                                  no   0.229 ± 0.028 0.219 ± 0.028 0.221 ± 0.022
 dermatology 366        35
                                 yes 0.212 ± 0.030 0.226 ± 0.030 0.228 ± 0.027


    Table 6. represents experimental results of classification using different dynamic
programming algorithms. For each data set and method, accuracy error and standard
deviation were calculated. As we can see the decomposition influences on accuracy. In
cases of two data sets, i.e., kr-vs-kp and cars, accuracy error incomparably increased
after decomposition. Note, that the compression for those date sets was very efficient.
42

However, for most of measurements, accuracy keeps a significant level or is slightly
better. The biggest improvement occurring when dynamic programing rules were used.


5    Conclusion

Effective data aggregation algorithms have been sought after for a long time due to the
increasing complexity of databases used in practice. Recently, some suggestions were
put forward that decomposition algorithms, previously used mainly in logic synthesis
of digital systems, may be applied for that purpose [13]. This approach is indeed very
relevant as decision systems and logic circuits are very similar. Bearing this in mind,
this paper demonstrates that a typical algorithm for the decomposition of binary data
tables (representing Boolean functions) may be applied to the decomposition of data
represented by multi-valued attributes used in decision systems.
    The paper indicates the advantages and possibilities of decomposition algorithms
for the purpose of classification. Results of experiments performed by proposed de-
composition algorithm and Dagger system has been presented. New attributes selec-
tion criteria describing partitions for decomposition has been introduced and used in
the experiments. Proposed method is particularly efficient in data compression. It al-
lows to build simple classification model and save memory, simultaneously keep the
accuracy. To achieved better results in accuracy data set decomposition requires further
research, particularity with attributes selection criteria. Also, there is a need to extend
the decomposition to deal with continuous attributes and noise in data.


Acknowledgments. The authors would like to thank professor Mikhail Moshkov and
his team for their support while writing this paper. This research has been supported by
King Abdullah University of Science and Technology.


References
 1. Perkowski, M.A., Luba, T., Grygiel, S., Burkey, P., Burns, M., Iliev, N., Kolsteren, M.,
    Lisanke, R., Malvi, R., Wang, Z., et al.: Unified approach to functional decompositions
    of switching functions. PSU Electr. Engn. Dept. Report (1995)
 2. Łuba, T., Lasocki, R., Rybnik, J.: An implementation of decomposition algorithm and its
    application in information systems analysis and logic synthesis. In Ziarko, W., ed.: Rough
    Sets, Fuzzy Sets and Knowledge Discovery. Workshops in Computing. Springer London
    (1994) 458–465
 3. Luba, T., Lasocki, R.: On unknown attribute values in functional dependencies. In: Pro-
    ceedings of the International Workshop on Rough Sets and Soft Computing,(San Jose, CA).
    (1994) 490–497
 4. Rokach, L., Maimon, O.: Data mining using decomposition methods. In Maimon, O.,
    Rokach, L., eds.: Data Mining and Knowledge Discovery Handbook. Springer US (2010)
    981–998
 5. Alkhalid, A., Amin, T., Chikalov, I., Hussain, S., Moshkov, M., Zielosko, B.: Optimiza-
    tion and analysis of decision trees and rules: dynamic programming approach. International
    Journal of General Systems 42(6) (2013) 614–634
                                                                                           43

 6. Borowik, G. In: Data Mining Approach for Decision and Classification Systems Using Logic
    Synthesis Algorithms. Volume 6. Springer International Publishing (2014) 3–23
 7. Zupan, B., Prof, S., Bratko, D.I.: Machine learning based on function decomposition. Tech-
    nical report, University of Ljubljana (1997)
 8. Zupan, B., Bohanec, M.: Experimental evaluation of three partition selection criteria for
    decision table decomposition. Informatica (Slovenia) 22(2) (1998)
 9. Pawlak, Z.: Rough sets. International Journal of Computer & Information Sciences 11(5)
    (1982) 341–356
10. Amin, T., Chikalov, I., Moshkov, M., Zielosko, B.: Dynamic programming approach to
    optimization of approximate decision rules. Information Sciences 221(0) (2013) 403 – 418
11. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees.
    CRC press (1984)
12. Lichman, M.: UCI machine learning repository (2013)
13. Maimon, O., Rokach, L.: Decomposition methodology for knowledge discovery and data
    mining. In Maimon, O., Rokach, L., eds.: Data Mining and Knowledge Discovery Handbook.
    Springer US (2005) 981–1003