Reduct Calculation and Discretization of Numeric Attributes in Sparse Decision Systems

Reduct Calculation and Discretization of Numeric Attributes in Sparse Decision Systems WojciechSwieboda Institute of Mathematics The University of Warsaw

Banacha 2 02-097 Warsaw Poland

HungSonNguyen Institute of Mathematics The University of Warsaw

Banacha 2 02-097 Warsaw Poland

Reduct Calculation and Discretization of Numeric Attributes in Sparse Decision Systems 2E8A6DCC2C241F1F5B3D1D8FD840B530 GROBID - A machine learning software for extracting information from scholarly documents

In this paper we discuss three problems in Data Mining Sparse Decision Systems: the problem of short reduct calculation, discretization of numerical attributes and rule induction. We present algorithms that provide approximate solutions to these problems and analyze the complexity of these algorithms.

Introduction

In the paper we discuss algorithms for Data Mining [3] Sparse Decision Tables. We first review basic notions of Information Systems, Decision Systems and Rough Set Theory [9]. We introduce a convenient representation for sparse decision tables and finally discuss algorithms for short reduct calculation, discretization and rule induction.

Rough Set Preliminaries

An information system is a pair I = (U, A) where U denotes the universe of objects and A is the set of attributes. An attribute a ∈ A is a mapping a : U → V a . The co-domain V a of attribute a is often also called the value set of attribute a.

A decision system is a pair D = (U, A∪{dec}) which is an information system with a distinguished attribute dec : U → {1, . . . , d} called a decision attribute. Attributes in A are called conditions or conditional attributes and may be either nominal or numeric (i.e. with V a ⊆ R).

Throughout this paper n will denote the number of objects in a decision system and k will denote the number of conditional attributes.

Sparse Data Sets and Decision Systems

In many situations a convenient way to represent the data set is in terms of Entity-Attribute-Value (EAV) Model [11], which encodes observations in terms of triples. For an information system I = (U, A), the set of triples is {(u, a, v) : a(u) = v}. This representation is especially handy for information systems with numerous attributes, missing or default values. Instances with missing and default values are not included in EAV representation, which results in compression of the data set. In this paper we are only dealing with default values. Their interpretation/semantics is the same as of any other attribute. In practice we store triples corresponding to numeric attributes and to

Problems for Sparse Decision Systems

In our paper we address the following problems for Sparse Decision Systems:

1. Finding a short reduct or a superreduct [1].

A reduct is a subset of attributes R ⊆ A which guarantees discernibility of objects belonging to different decision classes. 2. Discretization of numerical attributes [6].

Discretization of a decision system is determining a set of cuts on numerical attributes so that the induced partitions (i.e. intervals between cutpoints) guarantee discernibility of objects belonging to different decision classes. 3. Generating set of rules or dynamic rules [1].

a 1 a 2 a 33Decision x 1 (−∞, +∞) (1.25, +∞) (−∞, 1.2] F x 2 (−∞, +∞) (−∞, 1.1] (−∞, 1.2] F x 3 (−∞, +∞) (1.25, +∞) (−∞, 1.2] F x 4 (−∞, +∞) (1.1, 1.25] (1.2, +∞) F x 5 (−∞, +∞) (1.25, +∞) (1.2, +∞) F x 6 (−∞, +∞) (1.25, +∞) (1.2, +∞) T x 7 (−∞, +∞) (−∞, 1.1] (1.2, +∞)T

Table 1 .1A typical decision system with symbolic attributes represented as a table. Attributes Diploma, Experience, French and Reference are conditions, whereas Decision is the decision attribute. All conditional attributes in this decision system are nominalDiploma Experience French Reference Decisionx 1 MBA Medium Yes Excellent Acceptx 2 MBALowYes Neutral Rejectx 3 MCELowYesGoodRejectx 4 MScHighYes Neutral Acceptx 5 MScMedium Yes Neutral Rejectx 6 MScHighYes Excellent Acceptx 7 MBAHighNoGoodAcceptx 8 MCELowNo Excellent Reject

Table 2 .2A decision system in which all conditional attributes are numerica 1 a 2 a 3 Decisionx 1 0 1.3 0Fx 2 3.3 0.9 0Fx 3 0 1.5 0Fx 4 0 1.2 2.5Fx 5 0 1.3 3.6Fx 6 3.7 2.7 2.4Tx 7 4.1 1.0 2.8Tsymbolic attributes in two separate tables, and store decisions (which we assume arenever missing) of objects in a separate vector.Another related representation, more general then EAV model, is Subject-Predicate-Object (SPO), and is used e.g. in Resource Description Framework (RDF) Model andimplemented in several Triplestore databases.

Table 3 .3EAV representation of decision system in table1. The default values (omitted in this representation) for consecutive attributes are 'MBA', 'Low', 'Yes' and 'Excellent'Entity Attribute Valuex 1a 2Mediumx 2a 4Neutralx 3a 1MCEx 3 x 4 x 4 x 4 x 5 x 5 x 5 x 6 x 6 x 7a 4 a 1 a 2 a 4 a 1 a 2 a 4 a 1 a 2 a 2Good MSc High Neutral MSc Medium Neutral MSc High HighEntity Decision x 1 Accept x 2 Reject x 3 Reject x 4 Accept x 5 Reject x 6 Accept x 7 Accept x 8 Rejectx 7a 3Nox 7a 4Goodx 8a 1MCEx 8a 3No

Table 4 .4EAV representation of decision system in table 2. The default value (omitted in this representation) for each attribute is 0Entity Attribute Valuex 1a 21.3x 2a 13.3x 2 x 3 x 4 x 4 x 5 x 5 x 6 x 6 x 6a 2 a 2 a 2 a 3 a 2 a 3 a 1 a 2 a 30.9 1.5 1.2 2.5 1.3 3.6 3.7 2.7 2.4Entity Decision x 1 T x 2 T x 3 T x 4 T x 5 T x 6 T x 7 Tx 7a 14.1x 7a 21.0x 7a 32.8

Table 5 .5A discretized version of the decision system presented in table 2.

Rough set algorithms in classification problem JGBazan HSNguyen SHNguyen PSynak JWróblewski 2000 Pattern Classification RODuda PEHart DGStork 2001 Wiley New York 2. edn Principles of Data Mining DHand HMannila PSmyth 2001 MIT Press THastie RTibshirani JHFriedman The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations

New York

Springer-Verlag 2001 Rough sets: A tutorial JKomorowski ZPawlak LPolkowski ASkowron 1998 Discretization problem for rough sets methods HSNguyen 10.1007/3-540-69115-4\_75 Polkowski and Skowron 10 Approximate boolean reasoning: Foundations and applications in data mining HSNguyen 2006 Rough sets ZPawlak International Journal of Information and Computer Sciences 11 5 1982 Rough Sets. Theoretical Aspects of Reasoning about Data ZPawlak 1991 Formerly Kluwer Academic Publishers Boston, Dordrecht, London Rough Sets and Current Trends in Computing First International Conference, RSCTC'98 Proceedings, Lecture Notes in Computer Science LPolkowski ASkowron

Warsaw, Poland

Springer June 22-26, 1998. 1998 1424 A chartless record -is it adequate? WWStead WEHammond MJStraube 10.1007/BF00995117,10.1007/BF00995117 Journal of Medical Systems 7 1983