Reduct Calculation and Discretization of Numeric
          Attributes in Sparse Decision Systems

                       Wojciech Swieboda and Hung Son Nguyen

                     Institute of Mathematics, The University of Warsaw,
                              Banacha 2, 02-097, Warsaw Poland


       Abstract. In this paper we discuss three problems in Data Mining Sparse Deci-
       sion Systems: the problem of short reduct calculation, discretization of numerical
       attributes and rule induction. We present algorithms that provide approximate so-
       lutions to these problems and analyze the complexity of these algorithms.


1   Introduction
In the paper we discuss algorithms for Data Mining [3] Sparse Decision Tables. We
first review basic notions of Information Systems, Decision Systems and Rough Set
Theory [9]. We introduce a convenient representation for sparse decision tables and
finally discuss algorithms for short reduct calculation, discretization and rule induction.


2   Rough Set Preliminaries
An information system is a pair I = (U, A) where U denotes the universe of objects and
A is the set of attributes. An attribute a ∈ A is a mapping a : U → Va . The co-domain
Va of attribute a is often also called the value set of attribute a.
     A decision system is a pair D = (U, A∪{dec}) which is an information system with
a distinguished attribute dec : U → {1, . . . , d} called a decision attribute. Attributes in
A are called conditions or conditional attributes and may be either nominal or numeric
(i.e. with Va ⊆ R).
     Throughout this paper n will denote the number of objects in a decision system and
k will denote the number of conditional attributes.


3   Sparse Data Sets and Decision Systems
In many situations a convenient way to represent the data set is in terms of Entity-
Attribute-Value (EAV) Model [11], which encodes observations in terms of triples. For
an information system I = (U, A), the set of triples is {(u, a, v) : a(u) = v}. This
representation is especially handy for information systems with numerous attributes,
missing or default values. Instances with missing and default values are not included in
EAV representation, which results in compression of the data set. In this paper we are
only dealing with default values. Their interpretation/semantics is the same as of any
other attribute. In practice we store triples corresponding to numeric attributes and to
208

Table 1. A typical decision system with symbolic attributes represented as a table. At-
tributes Diploma, Experience, French and Reference are conditions, whereas Decision
is the decision attribute. All conditional attributes in this decision system are nominal

                   Diploma Experience French Reference Decision
                 x1 MBA Medium Yes Excellent Accept
                 x2 MBA      Low       Yes Neutral Reject
                 x3 MCE      Low       Yes     Good     Reject
                 x4 MSc      High      Yes Neutral Accept
                 x5 MSc     Medium Yes Neutral Reject
                 x6 MSc      High      Yes Excellent Accept
                 x7 MBA      High       No     Good     Accept
                 x8 MCE      Low        No Excellent Reject


       Table 2. A decision system in which all conditional attributes are numeric

                                  a1 a2 a3 Decision
                               x1 0 1.3 0     F
                               x2 3.3 0.9 0   F
                               x3 0 1.5 0     F
                               x4 0 1.2 2.5   F
                               x5 0 1.3 3.6   F
                               x6 3.7 2.7 2.4 T
                               x7 4.1 1.0 2.8 T


symbolic attributes in two separate tables, and store decisions (which we assume are
never missing) of objects in a separate vector.
   Another related representation, more general then EAV model, is Subject-Predicate-
Object (SPO), and is used e.g. in Resource Description Framework (RDF) Model and
implemented in several Triplestore databases.


4     Problems for Sparse Decision Systems

In our paper we address the following problems for Sparse Decision Systems:
 1. Finding a short reduct or a superreduct [1].
    A reduct is a subset of attributes R ⊆ A which guarantees discernibility of objects
    belonging to different decision classes.
 2. Discretization of numerical attributes [6].
    Discretization of a decision system is determining a set of cuts on numerical at-
    tributes so that the induced partitions (i.e. intervals between cutpoints) guarantee
    discernibility of objects belonging to different decision classes.
 3. Generating set of rules or dynamic rules [1].
                                                                                        209

Table 3. EAV representation of decision system in table 1. The default values (omitted
in this representation) for consecutive attributes are ’MBA’, ’Low’, ’Yes’ and ’Excel-
lent’
                       Entity Attribute Value
                        x1       a2    Medium
                        x2       a4     Neutral
                        x3       a1      MCE
                        x3       a4      Good
                                                Entity Decision
                        x4       a1      MSc
                                                 x1 Accept
                        x4       a2      High
                                                 x2     Reject
                        x4       a4     Neutral
                                                 x3     Reject
                        x5       a1      MSc
                                                 x4 Accept
                        x5       a2    Medium
                                                 x5     Reject
                        x5       a4     Neutral
                                                 x6 Accept
                        x6       a1      MSc
                                                 x7 Accept
                        x6       a2      High
                                                 x8     Reject
                        x7       a2      High
                        x7       a3       No
                        x7       a4      Good
                        x8       a1      MCE
                        x8       a3       No


Table 4. EAV representation of decision system in table 2. The default value (omitted
in this representation) for each attribute is 0

                        Entity Attribute Value
                         x1       a2      1.3
                         x2       a1      3.3
                         x2       a2      0.9
                                               Entity Decision
                         x3       a2      1.5
                                                x1       T
                         x4       a2      1.2
                                                x2       T
                         x4       a3      2.5
                                                x3       T
                         x5       a2      1.3
                                                x4       T
                         x5       a3      3.6
                                                x5       T
                         x6       a1      3.7
                                                x6       T
                         x6       a2      2.7
                                                x7       T
                         x6       a3      2.4
                         x7       a1      4.1
                         x7       a2      1.0
                         x7       a3      2.8


References
 1. Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms
    in classification problem pp. 49–88 (2000)
210

       Table 5. A discretized version of the decision system presented in table 2.

                          a1        a2          a3    Decision
                    x1 (−∞, +∞) (1.25, +∞) (−∞, 1.2]     F
                    x2 (−∞, +∞) (−∞, 1.1] (−∞, 1.2]      F
                    x3 (−∞, +∞) (1.25, +∞) (−∞, 1.2]     F
                    x4 (−∞, +∞) (1.1, 1.25] (1.2, +∞)    F
                    x5 (−∞, +∞) (1.25, +∞) (1.2, +∞)     F
                    x6 (−∞, +∞) (1.25, +∞) (1.2, +∞)     T
                    x7 (−∞, +∞) (−∞, 1.1] (1.2, +∞)      T


 2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York, 2. edn. (2001)
 3. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press (2001), http:
    //mitpress.mit.edu/026208290X
 4. Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data min-
    ing, inference, and prediction: with 200 full-color illustrations. New York: Springer-Verlag
    (2001)
 5. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial (1998)
 6. Nguyen, H.S.: Discretization problem for rough sets methods. In: Polkowski and Skowron
    [10], pp. 545–552, http://dx.doi.org/10.1007/3-540-69115-4\_75
 7. Nguyen, H.S.: Approximate boolean reasoning: Foundations and applications in data mining
    (2006)
 8. Pawlak, Z.: Rough sets. International Journal of Information and Computer Sciences 11(5),
    341–356 (1982)
 9. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Springer, Formerly
    Kluwer Academic Publishers, Boston, Dordrecht, London (1991)
10. Polkowski, L., Skowron, A. (eds.): Rough Sets and Current Trends in Computing, First Inter-
    national Conference, RSCTC’98, Warsaw, Poland, June 22-26, 1998, Proceedings, Lecture
    Notes in Computer Science, vol. 1424. Springer (1998)
11. Stead, W.W., Hammond, W.E., Straube, M.J.: A chartless record – is it adequate? Journal of
    Medical Systems 7, 103–109 (1983), http://dx.doi.org/10.1007/BF00995117,
    10.1007/BF00995117