=Paper=
{{Paper
|id=Vol-1492/Paper_45
|storemode=property
|title=Reduct Calculation and Discretization of Numeric Attributes in Sparse Decision Systems
|pdfUrl=https://ceur-ws.org/Vol-1492/Paper_45.pdf
|volume=Vol-1492
|dblpUrl=https://dblp.org/rec/conf/csp/SwiebodaN15
}}
==Reduct Calculation and Discretization of Numeric Attributes in Sparse Decision Systems==
Reduct Calculation and Discretization of Numeric
Attributes in Sparse Decision Systems
Wojciech Swieboda and Hung Son Nguyen
Institute of Mathematics, The University of Warsaw,
Banacha 2, 02-097, Warsaw Poland
Abstract. In this paper we discuss three problems in Data Mining Sparse Deci-
sion Systems: the problem of short reduct calculation, discretization of numerical
attributes and rule induction. We present algorithms that provide approximate so-
lutions to these problems and analyze the complexity of these algorithms.
1 Introduction
In the paper we discuss algorithms for Data Mining [3] Sparse Decision Tables. We
first review basic notions of Information Systems, Decision Systems and Rough Set
Theory [9]. We introduce a convenient representation for sparse decision tables and
finally discuss algorithms for short reduct calculation, discretization and rule induction.
2 Rough Set Preliminaries
An information system is a pair I = (U, A) where U denotes the universe of objects and
A is the set of attributes. An attribute a ∈ A is a mapping a : U → Va . The co-domain
Va of attribute a is often also called the value set of attribute a.
A decision system is a pair D = (U, A∪{dec}) which is an information system with
a distinguished attribute dec : U → {1, . . . , d} called a decision attribute. Attributes in
A are called conditions or conditional attributes and may be either nominal or numeric
(i.e. with Va ⊆ R).
Throughout this paper n will denote the number of objects in a decision system and
k will denote the number of conditional attributes.
3 Sparse Data Sets and Decision Systems
In many situations a convenient way to represent the data set is in terms of Entity-
Attribute-Value (EAV) Model [11], which encodes observations in terms of triples. For
an information system I = (U, A), the set of triples is {(u, a, v) : a(u) = v}. This
representation is especially handy for information systems with numerous attributes,
missing or default values. Instances with missing and default values are not included in
EAV representation, which results in compression of the data set. In this paper we are
only dealing with default values. Their interpretation/semantics is the same as of any
other attribute. In practice we store triples corresponding to numeric attributes and to
208
Table 1. A typical decision system with symbolic attributes represented as a table. At-
tributes Diploma, Experience, French and Reference are conditions, whereas Decision
is the decision attribute. All conditional attributes in this decision system are nominal
Diploma Experience French Reference Decision
x1 MBA Medium Yes Excellent Accept
x2 MBA Low Yes Neutral Reject
x3 MCE Low Yes Good Reject
x4 MSc High Yes Neutral Accept
x5 MSc Medium Yes Neutral Reject
x6 MSc High Yes Excellent Accept
x7 MBA High No Good Accept
x8 MCE Low No Excellent Reject
Table 2. A decision system in which all conditional attributes are numeric
a1 a2 a3 Decision
x1 0 1.3 0 F
x2 3.3 0.9 0 F
x3 0 1.5 0 F
x4 0 1.2 2.5 F
x5 0 1.3 3.6 F
x6 3.7 2.7 2.4 T
x7 4.1 1.0 2.8 T
symbolic attributes in two separate tables, and store decisions (which we assume are
never missing) of objects in a separate vector.
Another related representation, more general then EAV model, is Subject-Predicate-
Object (SPO), and is used e.g. in Resource Description Framework (RDF) Model and
implemented in several Triplestore databases.
4 Problems for Sparse Decision Systems
In our paper we address the following problems for Sparse Decision Systems:
1. Finding a short reduct or a superreduct [1].
A reduct is a subset of attributes R ⊆ A which guarantees discernibility of objects
belonging to different decision classes.
2. Discretization of numerical attributes [6].
Discretization of a decision system is determining a set of cuts on numerical at-
tributes so that the induced partitions (i.e. intervals between cutpoints) guarantee
discernibility of objects belonging to different decision classes.
3. Generating set of rules or dynamic rules [1].
209
Table 3. EAV representation of decision system in table 1. The default values (omitted
in this representation) for consecutive attributes are ’MBA’, ’Low’, ’Yes’ and ’Excel-
lent’
Entity Attribute Value
x1 a2 Medium
x2 a4 Neutral
x3 a1 MCE
x3 a4 Good
Entity Decision
x4 a1 MSc
x1 Accept
x4 a2 High
x2 Reject
x4 a4 Neutral
x3 Reject
x5 a1 MSc
x4 Accept
x5 a2 Medium
x5 Reject
x5 a4 Neutral
x6 Accept
x6 a1 MSc
x7 Accept
x6 a2 High
x8 Reject
x7 a2 High
x7 a3 No
x7 a4 Good
x8 a1 MCE
x8 a3 No
Table 4. EAV representation of decision system in table 2. The default value (omitted
in this representation) for each attribute is 0
Entity Attribute Value
x1 a2 1.3
x2 a1 3.3
x2 a2 0.9
Entity Decision
x3 a2 1.5
x1 T
x4 a2 1.2
x2 T
x4 a3 2.5
x3 T
x5 a2 1.3
x4 T
x5 a3 3.6
x5 T
x6 a1 3.7
x6 T
x6 a2 2.7
x7 T
x6 a3 2.4
x7 a1 4.1
x7 a2 1.0
x7 a3 2.8
References
1. Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms
in classification problem pp. 49–88 (2000)
210
Table 5. A discretized version of the decision system presented in table 2.
a1 a2 a3 Decision
x1 (−∞, +∞) (1.25, +∞) (−∞, 1.2] F
x2 (−∞, +∞) (−∞, 1.1] (−∞, 1.2] F
x3 (−∞, +∞) (1.25, +∞) (−∞, 1.2] F
x4 (−∞, +∞) (1.1, 1.25] (1.2, +∞) F
x5 (−∞, +∞) (1.25, +∞) (1.2, +∞) F
x6 (−∞, +∞) (1.25, +∞) (1.2, +∞) T
x7 (−∞, +∞) (−∞, 1.1] (1.2, +∞) T
2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York, 2. edn. (2001)
3. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press (2001), http:
//mitpress.mit.edu/026208290X
4. Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data min-
ing, inference, and prediction: with 200 full-color illustrations. New York: Springer-Verlag
(2001)
5. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial (1998)
6. Nguyen, H.S.: Discretization problem for rough sets methods. In: Polkowski and Skowron
[10], pp. 545–552, http://dx.doi.org/10.1007/3-540-69115-4\_75
7. Nguyen, H.S.: Approximate boolean reasoning: Foundations and applications in data mining
(2006)
8. Pawlak, Z.: Rough sets. International Journal of Information and Computer Sciences 11(5),
341–356 (1982)
9. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Springer, Formerly
Kluwer Academic Publishers, Boston, Dordrecht, London (1991)
10. Polkowski, L., Skowron, A. (eds.): Rough Sets and Current Trends in Computing, First Inter-
national Conference, RSCTC’98, Warsaw, Poland, June 22-26, 1998, Proceedings, Lecture
Notes in Computer Science, vol. 1424. Springer (1998)
11. Stead, W.W., Hammond, W.E., Straube, M.J.: A chartless record – is it adequate? Journal of
Medical Systems 7, 103–109 (1983), http://dx.doi.org/10.1007/BF00995117,
10.1007/BF00995117