Reduct Calculation and Discretization of Numeric Attributes in Sparse Decision Systems Wojciech Swieboda and Hung Son Nguyen Institute of Mathematics, The University of Warsaw, Banacha 2, 02-097, Warsaw Poland Abstract. In this paper we discuss three problems in Data Mining Sparse Deci- sion Systems: the problem of short reduct calculation, discretization of numerical attributes and rule induction. We present algorithms that provide approximate so- lutions to these problems and analyze the complexity of these algorithms. 1 Introduction In the paper we discuss algorithms for Data Mining [3] Sparse Decision Tables. We first review basic notions of Information Systems, Decision Systems and Rough Set Theory [9]. We introduce a convenient representation for sparse decision tables and finally discuss algorithms for short reduct calculation, discretization and rule induction. 2 Rough Set Preliminaries An information system is a pair I = (U, A) where U denotes the universe of objects and A is the set of attributes. An attribute a ∈ A is a mapping a : U → Va . The co-domain Va of attribute a is often also called the value set of attribute a. A decision system is a pair D = (U, A∪{dec}) which is an information system with a distinguished attribute dec : U → {1, . . . , d} called a decision attribute. Attributes in A are called conditions or conditional attributes and may be either nominal or numeric (i.e. with Va ⊆ R). Throughout this paper n will denote the number of objects in a decision system and k will denote the number of conditional attributes. 3 Sparse Data Sets and Decision Systems In many situations a convenient way to represent the data set is in terms of Entity- Attribute-Value (EAV) Model [11], which encodes observations in terms of triples. For an information system I = (U, A), the set of triples is {(u, a, v) : a(u) = v}. This representation is especially handy for information systems with numerous attributes, missing or default values. Instances with missing and default values are not included in EAV representation, which results in compression of the data set. In this paper we are only dealing with default values. Their interpretation/semantics is the same as of any other attribute. In practice we store triples corresponding to numeric attributes and to 208 Table 1. A typical decision system with symbolic attributes represented as a table. At- tributes Diploma, Experience, French and Reference are conditions, whereas Decision is the decision attribute. All conditional attributes in this decision system are nominal Diploma Experience French Reference Decision x1 MBA Medium Yes Excellent Accept x2 MBA Low Yes Neutral Reject x3 MCE Low Yes Good Reject x4 MSc High Yes Neutral Accept x5 MSc Medium Yes Neutral Reject x6 MSc High Yes Excellent Accept x7 MBA High No Good Accept x8 MCE Low No Excellent Reject Table 2. A decision system in which all conditional attributes are numeric a1 a2 a3 Decision x1 0 1.3 0 F x2 3.3 0.9 0 F x3 0 1.5 0 F x4 0 1.2 2.5 F x5 0 1.3 3.6 F x6 3.7 2.7 2.4 T x7 4.1 1.0 2.8 T symbolic attributes in two separate tables, and store decisions (which we assume are never missing) of objects in a separate vector. Another related representation, more general then EAV model, is Subject-Predicate- Object (SPO), and is used e.g. in Resource Description Framework (RDF) Model and implemented in several Triplestore databases. 4 Problems for Sparse Decision Systems In our paper we address the following problems for Sparse Decision Systems: 1. Finding a short reduct or a superreduct [1]. A reduct is a subset of attributes R ⊆ A which guarantees discernibility of objects belonging to different decision classes. 2. Discretization of numerical attributes [6]. Discretization of a decision system is determining a set of cuts on numerical at- tributes so that the induced partitions (i.e. intervals between cutpoints) guarantee discernibility of objects belonging to different decision classes. 3. Generating set of rules or dynamic rules [1]. 209 Table 3. EAV representation of decision system in table 1. The default values (omitted in this representation) for consecutive attributes are ’MBA’, ’Low’, ’Yes’ and ’Excel- lent’ Entity Attribute Value x1 a2 Medium x2 a4 Neutral x3 a1 MCE x3 a4 Good Entity Decision x4 a1 MSc x1 Accept x4 a2 High x2 Reject x4 a4 Neutral x3 Reject x5 a1 MSc x4 Accept x5 a2 Medium x5 Reject x5 a4 Neutral x6 Accept x6 a1 MSc x7 Accept x6 a2 High x8 Reject x7 a2 High x7 a3 No x7 a4 Good x8 a1 MCE x8 a3 No Table 4. EAV representation of decision system in table 2. The default value (omitted in this representation) for each attribute is 0 Entity Attribute Value x1 a2 1.3 x2 a1 3.3 x2 a2 0.9 Entity Decision x3 a2 1.5 x1 T x4 a2 1.2 x2 T x4 a3 2.5 x3 T x5 a2 1.3 x4 T x5 a3 3.6 x5 T x6 a1 3.7 x6 T x6 a2 2.7 x7 T x6 a3 2.4 x7 a1 4.1 x7 a2 1.0 x7 a3 2.8 References 1. Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classification problem pp. 49–88 (2000) 210 Table 5. A discretized version of the decision system presented in table 2. a1 a2 a3 Decision x1 (−∞, +∞) (1.25, +∞) (−∞, 1.2] F x2 (−∞, +∞) (−∞, 1.1] (−∞, 1.2] F x3 (−∞, +∞) (1.25, +∞) (−∞, 1.2] F x4 (−∞, +∞) (1.1, 1.25] (1.2, +∞) F x5 (−∞, +∞) (1.25, +∞) (1.2, +∞) F x6 (−∞, +∞) (1.25, +∞) (1.2, +∞) T x7 (−∞, +∞) (−∞, 1.1] (1.2, +∞) T 2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York, 2. edn. (2001) 3. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press (2001), http: //mitpress.mit.edu/026208290X 4. Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data min- ing, inference, and prediction: with 200 full-color illustrations. New York: Springer-Verlag (2001) 5. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial (1998) 6. Nguyen, H.S.: Discretization problem for rough sets methods. In: Polkowski and Skowron [10], pp. 545–552, http://dx.doi.org/10.1007/3-540-69115-4\_75 7. Nguyen, H.S.: Approximate boolean reasoning: Foundations and applications in data mining (2006) 8. Pawlak, Z.: Rough sets. International Journal of Information and Computer Sciences 11(5), 341–356 (1982) 9. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Springer, Formerly Kluwer Academic Publishers, Boston, Dordrecht, London (1991) 10. Polkowski, L., Skowron, A. (eds.): Rough Sets and Current Trends in Computing, First Inter- national Conference, RSCTC’98, Warsaw, Poland, June 22-26, 1998, Proceedings, Lecture Notes in Computer Science, vol. 1424. Springer (1998) 11. Stead, W.W., Hammond, W.E., Straube, M.J.: A chartless record – is it adequate? Journal of Medical Systems 7, 103–109 (1983), http://dx.doi.org/10.1007/BF00995117, 10.1007/BF00995117