=Paper=
{{Paper
|id=Vol-1458/E21_CRC12_Beer
|storemode=property
|title=An In-Database Rough Set Toolkit
|pdfUrl=https://ceur-ws.org/Vol-1458/E21_CRC12_Beer.pdf
|volume=Vol-1458
|dblpUrl=https://dblp.org/rec/conf/lwa/BeerB15
}}
==An In-Database Rough Set Toolkit==
<pdf width="1500px">https://ceur-ws.org/Vol-1458/E21_CRC12_Beer.pdf</pdf>
<pre>
             An In-Database Rough Set Toolkit

                            Frank Beer and Ulrich Bühler

                          University of Applied Sciences Fulda
                      Leipziger Straße 123, 36037 Fulda, Germany
                  {frank.beer,u.buehler}@informatik.hs-fulda.de


        Abstract. The Rough Set Theory is a common methodology to dis-
        cover hidden patterns in data. Most software systems and libraries using
        methods of that theory originated in the mid 1990s and suffer from time-
        consuming operations or high communication costs. Today on the other
        hand there is a perceptible trend for in-database analytics allowing on-
        demand decision support. While data processing and predictive models
        remain in one system, data movement is eliminated and latency is re-
        duced. In this paper we contribute to this trend by computing traditional
        rough sets solely inside relational databases. As such we leverage the effi-
        cient data structures and algorithms provided by that systems. Thereby
        we introduce a baseline framework for in-database mining supported by
        Rough Set Theory. Immediately, it can be utilized for common discov-
        ery tasks such as feature selection or reasoning under uncertainty and is
        applicable to most conventional databases as our experiments indicate.

        Keywords: concept approximation, in-database analytics, knowledge
        discovery in databases, relational algebra, relational database systems,
        rough set theory


1     Introduction
Over the past decades, the huge quantities of data accumulating as a part of
business operations or scientific research raised the necessity for managing and
analyzing them effectively. As a result, Rough Set Theory (RST) became sub-
ject to these interdisciplinary areas as reliable instrument of extracting hidden
knowledge from data. That trend is visible in the versatile existence of rough set-
based software libraries and tools interfacing data from flat files [1–4]. The design
of such libraries and tools, however, suffers when applying them to real-world
data sets due to resource and time-consuming file operations. To overcome this
technological drawback, researchers have made the effort to build more scalable
rough set systems by utilizing relational databases which provide very efficient
structures and algorithms designed to handle huge amounts of information [5–9].
    Copyright c 2015 by the papers authors. Copying permitted only for private and
    academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of
    the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9.
    October 2015, published at http://ceur-ws.org


                                         146
However, the exploitation of database technology can be further extended. One
can assess these relational systems to be expandable platforms capable of solv-
ing complex mining tasks independently. This design principle has been broadly
established under the term in-database analytics [10]. It provides essential bene-
fits, because hidden knowledge is stored in relational repositories predominantly
either given through transactional data or warehouses. Thus, pattern extraction
can be applied in a more data-centric fashion. As such, data transports to exter-
nal mining frameworks are minimized and processing time can be reduced to a
large extend. That given, one can observe database manufacturers continiously
expand their engines for analytical models1 such as association rule mining or
data classification.
     A full integration of rough sets inside relational systems is most favorable
where both processing and data movement is costly. Unfortunately in-database
processing and related applications are only covered partially in existing RST
literature. Few practical attempts have been made to express the fundamental
concept approximation based on existing database operations. In this paper we
concentrate on that gap and present a concrete model to calculate rough sets in-
side relational databases. We redefine the traditional concept approximation and
compute it by utilizing extended relational algebra. This model can be translated
to various SQL dialects and thus enriches most conventional database systems.
In line with ordinary RST our proposed model can be applied to common mining
problems such as dimensionality reduction, pattern extraction or classification.
Instrumenting SQL and its extensions enable us to cover further steps in the
classic knowledge discovery process implicitly including selection and prepro-
cessing. Combined, we obtain a baseline toolkit for in-database mining which
relies on rough set methodology and database operations. It is natively appli-
cable without the use of external software logic at low communication costs.
Additionally, relational database engines have been significantly improved over
the last decades, implementing both a high degree of parallelism for queries and
physical operations based on hash algorithms which is a major factor for the
efficiency of our model.
     The remainder is structured as follows: First we present important aspects of
the RST (Section 2). In Section 3 we review ideas and prototypes developed by
other authors. Section 4 restructures the concept approximation. The resulting
propositions are utilized to build a model based on database operations in Section
5. Then we briefly demonstrate how our model scales (Section 6). Based on that,
we present future work (Section 7) and conclude in Section 8.

2     Rough Set Preliminaries
Proposed in the early 1980s by Zdzislaw Pawlak [11, 12], RST is a mathemati-
cal framework to analyze data under vagueness and uncertainty. In this section
1
    see Data Mining Extensions for Microsoft SQL Server: https://msdn.microsoft.
    com/en-us/library/ms132058.aspx (June, 2015) or Oracle Advanced Analytics:
    http://oracle.com/technetwork/database/options/advanced-analytics (June,
    2015)


                                     147
we outline principles of that theory: the basic data structures including the
indiscernibility relation (Section 2.1) and the illustration of the concept approx-
imation (Section 2.2).

2.1   Information Systems and Object Indiscernibility
Information in RST is structured in an Information System (IS) [13], i.e. a data
table consisting of objects and attributes. Such an IS can thus be expressed in
a tuple A = hU, Ai, where the universe of discourse U = {x1 , ..., xn }, n ∈ N,
is a set of objects characterized by the feature set A = {a1 , ..., am }, m ∈ N,
such that a : U → Va , ∀a ∈ A, where Va represents the value range of attribute
a. An extension to an IS is the Decision System (DS). A DS even holds a set
of attributes where some context-specific decision is represented. It consists of
common condition features A and the decision attributes di ∈ D with di : U →
Vdi , 1 ≤ i ≤ |D| and A ∩ D = ∅. A DS is denoted by AD = hU, A, Di. If we have
for any a ∈ A ∪ D : a(x) =⊥, i.e. a missing value, the underlying structure is
called incomplete, otherwise we call it complete.
    The indiscernibility relation classifies objects based on their characteristics.
Formally, it is a parametrizable equivalence relation with respect to a specified
attribute set and can be defined as follows: Let be an IS A = hU, Ai, B ⊆ A, then
the indiscernibility relation IN DA (B) = {(x, y) ∈ U2 | a(x) = a(y), ∀a ∈ B}
induces a partition U/IN DA (B) = {K1 , ..., Kp }, p ∈ N of disjoint equivalence
classes over U with respect to B. Out of convenience we write IN DB or U/B to
indicate the resulting partition.

2.2   Concept Approximation
To describe or predict an ordinary set of objects in the universe, RST provides an
approximation of that target concept applying the indiscernibility relation. Let
be A = hU, Ai, B ⊆ A and a concept X ⊆ U. Then, the B-lower approximation
of the concept X can be specified through
                               S
                         X B = {K ∈ IN DB | K ⊆ X}                             (1)

while the B-upper approximation of X is defined as
                          S
                    X B = {K ∈ IN DB | K ∩ X 6= ∅} .                               (2)

Traditionally, (1) and (2) can be expressed in a tuple hX B , X B i, i.e. the rough set
approximation of X with respect to the knowledge in B. In a rough set, we can
assert objects in X B to be fully or partly contained in X, while objects in X B
can be determined to be surely in the concept. Hence, there may be equivalence
classes which describe X only in an uncertain fashion. This constitutes the B-
boundary X B = X B − X B . Depending on the characteristics of X B we get
an indication of the roughness of hX B , X B i. For X B = ∅, we can classify X
decisively, while for X B 6= ∅, the information in B appears to be insufficient to
describe X properly. The latter leads to an inconsistency in the data. The rest
of objects not involved in hX B , X B i seems to be unimportant and thus can be


                                       148
disregarded. This set is called B-outside region and is the relative complement
of X B with respect to U, i.e. U − X B .
    When we are focused in approximating all available concepts induced by
the decision attributes, RST provides general notations consequently. Let be
AD = hU, A, Di and B ⊆ A, E ⊆ D, then all decision classes induced by IN DE
can be expressed and analyzed by two sets, i.e. the B-positive region denoted as
                                         S
                          P OSB (E) = X∈IN DE X B                            (3)
and the B-boundary region
                                         S
                          BN DB (E) =       X∈IN DE   XB .                       (4)

For P OSB (E) and BN DB (E) we get a complete indication whether the expres-
siveness of attributes B is sufficient in order to classify objects well in terms of
the decisions given in E. Based on that, the concept approximation is suitable
for a varity of data mining problems. Among others, it can be applied to quan-
tify imprecision, rule induction or feature dependency analysis including core
and reduct computation for dimensionality reduction [12].

3   Related Work
The amount of existing RST literature intersecting with databases theory in-
creased continuously since the beginning. In this section we outline the most
relevant concepts and systems introduced by other authors.
     One of the first systems combining RST with database systems was intro-
duced in [5]. The presented approach exploits database potentials only partially,
because used SQL commands are embedded inside external programming logic.
Porting this sort-based implementation for in-database applications implies the
usage of procedural structures such as cursors, which is not favorable in process-
ing enormous data. In [14], the authors modify relational algebra to calculate
the concept approximation. Database internals need to be touched and hence a
general employment is not given. The approaches in [6, 7] utilize efficient rela-
tional algebra for feature selection. The algorithms omit the usage of the concept
approximation by other elaborated rough set properties. This factor limits the
application to dimension reduction only. Sun et al. calculate rough sets based
on extended equivalence matrices inside databases [9]. Once data is transformed
into that matrix structure, the proposed methods apply but rely on procedural
logic rather than scalable database operations. The work of Nguyen aims for a
reduction of huge data loads in the knowledge discovery process [15]. Therefore
appropriate methods are introduced using simpler SQL queries to minimize traf-
fic in client-server architectures. The software design follows to the one in [5]. In
[8], Chan transforms RST into a multiset decision table which allows to calcu-
late the concept approximation with database queries. The initial construction
of such a data table relies on the execution of dynamic queries, helper tables and
row-by-row updates as stated in [16] and thus depends on inefficient preprocess-
ing. The work of Naouali et al. implements α-RST in data warehouse environ-
ments [17]. The algorithm relies on iterative processing and insert commands to


                                      149
determine the final classification. Details about its efficiency are not presented.
Another model is known as rough relational database [18]. These systems base
on multi-valued relations designed to query data under uncertainty. Over the
years, specific operations and properties of this theoretic model have been fur-
ther extended. The authors in [19] try to port the rough relational data model
to mature database systems. Details of migrating its algebra are not reported.
Infobright is another database system that focuses on fast data processing to-
wards ad-hoc querying [20]. This is achieve by a novel data retrieval strategy
based on compression and inspired by RST. Data is organized underneath the
knowledge grid. It is used to get estimated query results rather than seeking
costly information from disk, which is valid to some domain of interest.
    Most discussed approaches utilize inefficient procedural structures, external
programs or leverage relational operations for very specific subjects. In contrast,
we make use of existing, reliable and highly optimized database operations to
compute the concept approximation not employing further procedural mecha-
nisms. With this, we stretch the applicability of independent databases to a
broader range of rough set mining problems.


4   Redefining the Concept Approximation

This section points out the formal ideas of transforming Pawlak’s concept ap-
proximation to relational database systems by introducing a mapping of (1) and
(2) to rewritten set-oriented expressions. Those propositions can then be applied
to database algebra easily and enable us to transport both, the positive region
and the boundary region in addition. We also show that these redefinings are no
extensions to the traditional model, but equivalent terms.
    Explained in Section 2.2, a rough set hX B , X B i can typically be extracted
from a concept X ⊆ U of an IS A = hU, Ai on a specific attribute set B ⊆ A,
while the classification of each object is based on the induced partition U/B. At
this point, we make use of X/B := X/IN DA (B) = {H1 , ..., Hq }, q ∈ N, restruc-
turing the concept approximation of X. Thus, we can deduce two relationships
between classes H ∈ X/B and K ∈ U/B: H ∩ K 6= ∅, H ∩ K = ∅. This basic
idea leads to two propositions, which we discuss in the remainder of this section:
                                 S
                          X B = {H ∈ U/B | H ∈ X/B}                            (5)

Proof. Considering the classes H ∈ X/B, the following two cases are of interest
to form the B-lower approximation: (a) ∃K ∈ U/B : K = H ⊆ X and (b)
∃K ∈ U/B : K 6= H and K ∩ H 6= ∅. Case (b) implies ∃z ∈ K : z ∈           / X
and thus K * X. As a result, only classes K = H are relevant. Likewise, (1)
only contains objects of classes K ∈ U/B, where K ⊆ X. We consider X/B
that induces classes H ∈ U/B and H 0 ∈/ U/B, because X ⊆ U. Combined, we
immediately get to (5).                                                      t
                                                                             u
                          S
                   XB =    {K ∈ U/B | ∃H ∈ X/B : H ⊆ K}                        (6)


                                     150
Proof. On the one hand, the partition X/B can only produce equivalence classes
H, H 0 ⊆ X which satisfy H ∈ U/B and H 0 ∈  / U/B. Obviously, those H are mem-
bers of the B-lower approximation, whereas each class H 0 has a matching partner
class K with H 0 ⊂ K ∈ U/B which build the B-boundary approximation. With
these classes H, K, we directly receive: X B = X B ∪ X B . On the other hand,
X B holds objects of classes K ∈ U/B with K ∩ X 6= ∅ (see (2)), i.e. each class
K ∈ X/B and K ⊃ H ∈ X/B. This is proposed by (6).                             t
                                                                              u
    Up to this point, the B-boundary approximation and the B-outside region
remain untouched for further restructuring since both sets build on the B-lower
and B-upper approximation. They have the same validity to the propositions in
(5) and (6) as to the classical rough set model.

5     Combining RST and Database Systems
5.1   Information Systems and Database Tables
The IS is a specific way to organize data, similar to a data table in relational
database terms. But there are essential differences in their scientific scopes [13].
While an IS is used to discover patterns in a snapshot fashion, the philosophy of
databases concerns with long term data storing and retrieval respectively [21].
    However, we try to overcome these gaps by simply assembling an IS or DS to
the relational database domain considering the following: Let be AD = hU, A, Di
with the universe U = {x1 , ..., xn }, n ∈ N, the features A = {a1 , ..., am }, m ∈ N
and the decision D = {d1 , ..., dp }, p ∈ N, then we use the traditional notation of
a (m + p)-ary database relation R ⊆ Va1 × ... × Vam × Vd1 × ... × Vdp , where Vai
and Vdj are the attribute domains of ai , dj , 1 ≤ i, j, ≤ m, p.
    In database theory, the order of attributes in a relation schema has signifi-
cance to both semantics and operations. With this we simplify the employment
of attributes to finite sets and write A = {a1 , ..., aq }, q ∈ N for the ordered
appearance in relation R. We notate RA as shortform or RA+D to identify a
decision table. Furthermore modern databases permits duplicated tuples within
its relational structure. We adopt this rudiment with practical relevance and
designate these types of relations as database relation or data table respectively.

5.2   Indiscernibility and Relational Operations
Inspired by [5–7], we make use of extended relational algebra to calculate the
partition of the indiscernibility relation. Using the projection operation πB (RA )
allows to project tuples t ∈ RA to a specified feature subset B ⊆ A while
eliminating duplicates. Thus, we get each class represented by a proxy tuple
with schema B. A column reduction without duplicate elimination is indicated
     +
by πB  (RA ). Additionally, we introduce the selection operation σφ (RA ) with filter
property φ and output schema A. Given a geometric repository RA+D (see Figure
1), we may query objects x ∈ RA+D that are colored red by σx.color=red (RA+D ).
    Most relational database systems provide an extension to πB (RA ), i.e. the
grouping operator γ. It groups tuples of a relation RA if they share identical val-
ues entirely over an specified attribute set G ⊆ A, i.e. the grouping attributes.


                                      151
Each group is only represented once in the resulting relation through a proxy
tuple (see π-operator). In addition, γ can be enriched with aggregation func-
tions f1 , ..., fn that may be applied to each group during the grouping phase.
Generally, this operation can be defined by γF ;G;B (RA ), where B are the output
features with B ⊆ G ⊆ A and F = {f1 , ..., fn }, n ∈ N0 . For our purpose we
simply count the number of members in each single group (class) of RA , i.e. the
cardinality expressed by the aggregate count(∗), and include it as new feature.
Consolidated, we make use of the following notation
                  G
                 IB (RA ) := ρcard←count(∗) (γ{count(∗)};G;B (RA ))                         (7)

where ρb←a (RA ) is the renaming operation of an arbitrary attribute a ∈ A
                                         B
to its new name b in table RA . Then IB    (RA ) is supposed to be noted as our
compressed multiset representation of a given database table RA considering
feature set B ⊆ A. An illustration of this composed operation and its parallels
to the RST is depicted in Figure 1 with A = {shape, color}, D = {d} and B = A.

                                                                            B
                                                                           IB (RA+D )
                  RA+D                                              card    shape color
                                       x          x     x
            -   shape color d       x 1 7 d(x) = 0 10x 9 2x 3        2     triangle green
           x1 triangle green 0                                       4       circle  red
           x2  circle   red 1                                        3       circle white
                                                                ⇔    1      square yellow
           x3  circle   red 1
           x4  circle white 1 ⇔ K 1       K2
                                                                       B+D
           x5 square yellow 1                                         IB     (RA+D )
           x6  circle white 1                                   ⇔ card shape color
           x7 triangle green 0                                      2 triangle green
                                K3        K4
           x8  circle white 1                                       2     circle  red
                                    x 4 d(x) = 1 x 5
           x9  circle   red 0     x8 x6                             2     circle  red
           x10 circle   red 0                                       3     circle white
                                                                    1    square yellow

        Fig. 1. Mapping the object indiscernibility to relational data tables


5.3   Mapping the Concept Approximation

In practice, the extraction process of a single target concept may vary dependent
on domain and underlying data model. In most cases an ordinary target concept
can be modelled through decision attributes, i.e. a decision table. However there
might be domains of interest, where the concept is located outside the origi-
nal data table. Especially, this is the case in highly normalized environments.
We support both approaches within the boundaries of the relational model. As
simplification we assume the target concept CA and the original data collection
RA to be given through either adequate relational operations or their native
existence in a relation where CA is a subset of RA .
    Taking this and the previous sections into account, we are now able to demon-
strate the classical concept approximation in terms of relational algebra and its
extensions: Let be RA representing the universe and CA our target concept to
be examined with the feature subset B ⊆ A, then the B-lower approximation of


                                         152
the concept can be expressed by
                                        B          B
                      LB (RA , CA ) := IB (CA ) ∩ IB (RA ) .                   (8)
Initially, (8) establishes the partition for CA and RA independently. The inter-
section then only holds those kinds of equivalence classes included with their full
cardinality in both induced partitions, i.e. the B-lower approximation in terms
of the RST (see (5)). The B-upper approxiation contains all equivalence classes
associated with the target concept. Thus, we simply can extract one representa-
tive of these classes from the induced partition of CA applying the information
in B. However, this information is not sufficient to get the correct cardinality of
those classes involved. Hence we must consider the data space of RA in order to
find the number of all equivalences. That methodology can be expressed through
                                                     B
                       UB (RA , CA ) := πB (CA )    IB (RA )                   (9)
whereas        is the natural join operator, assembling two data tables SG , TH
to a new relation R such that s.b = t.b for all tuples s ∈ SG , t ∈ TH and
attributes b ∈ G ∩ H. Note, R consists of all attributes in G, H, where overlap-
ping attributes are shown only once. As a result, we get all equivalence classes
with their cardinality, involved in the B-upper approximation (see (6)). Classi-
cally, the B-boundary consists of objects located in the set-difference of B-upper
and B-lower approximation. Because of the structural unity of LB (RA , CA ) and
UB (RA , CA ), it can be expressed by
                  BB (RA , CA ) := UB (RA , CA ) − LB (RA , CA ) .            (10)
Equivalence classes outside the concept approximation can be found when search-
ing for tuples not included in the B-upper approximation. With the support of
       B
both IB  (RA ) and UB (RA , CA ), we therefore get the B-outside region
                                      B
                    OB (RA , CA ) := IB (RA ) − UB (RA , CA ) .               (11)
   In order to present an equivalent relational mapping of (3) and (4), we first
have to look at a methodology that allows us to query each target concept
separately. Within a decision table RA+D , let us assume the partition induced
by the information in E ⊆ D consists of n decision class. For each of these classes
we can find an appropriate condition φi , 1 ≤ i ≤ n that assists in extracting
                                                              φi
the associate tuples t ∈ RA+D belonging to each concept CA       . One can simply
think of a walk through πE (RA+D ). In the i-th iteration we fetch the decision
                   Vvm , for the corresponding features in E = {d1 , ...dm }, m ∈ N
values, say v1 , ...,
and build φi = 1≤j≤m t.dj = vj . Thus, we have access to each decision class
  φi       +
CA   = πA    (σφi (RA+D )) produced by E. With this idea in mind and supported
by (8) we are now able to introduce the B-positive region: In a decision table
RA+D and B ⊆ A, E ⊆ D, the B-positive region is the union of all B-lower
approximations induced by the attributes in E. Those concepts can be retrieved
     φi
by CA   , 1 ≤ i ≤ n where n is the cardinality of πE (RA+D ). As a consequence we
get to
                                   φ1                       φn
                      LB (RA+D , CA   ) ∪ ... ∪ LB (RA+D , CA  )             (12)


                                     153
which can be rewritten as
                       S               B   φi     B
                            i=1,...,n IB (CA ) ∩ IB (RA+D )                       (13)
such that we finally have the B-positive region in relational terms defined over
a decision table
                                       B+E
                  LE
                   B (RA+D ) := πB 0 (IB
                                                       B
                                           (RA+D )) ∩ IB (RA+D )                  (14)
with B 0 = {card, b1 , ..., bk }, bj ∈ B, 1 ≤ j ≤ k. Likewise, the B-boundary region
                                       φ1                   φn
consists of tuples in UB (RA+D , CA       )∪...∪UB (RA+D , CA  ) but not in LE
                                                                             B (RA+D ),
         φi
where CA , 1 ≤ i ≤ n ∈ N are the separated target concepts induced by E. Hence,
we can query these through
                      S                          φi     E
                          i=1,...,n UB (RA+D , CA ) − LB (RA+D )                  (15)
which is equivalent to
                  B                   B+E             B
                 IB (RA+D ) − (πB 0 (IB   (RA+D )) ∩ IB (RA+D ))                  (16)
in a complete decision table and immediately come to our definition of the B-
boundary region
                  E             B                  B+E
                 BB (RA+D ) := IB (RA+D ) − πB 0 (IB   (RA+D ))                   (17)
where B 0 = {card, b1 , ..., bk }, bj ∈ B, 1 ≤ j ≤ k. Denote, we directly deduced
LE                 E
 B (RA+D ) and BB (RA+D ) from (8) and (9). For practical reasons, further sim-
plification can be applied by removing the π-operator. One may verify, this
change still preserves the exact same result set, because both expressions rely
      B
on IB   (RA+D ) initially.

6     Experimental Results
In this section, we present the initial experimental results applying the concluded
expressions from Section 5.3 to some well-known data sets and two database
systems. The objective of this experiment is to demonstrate the performance of
our model in a conservative test environment not utilizing major optimization
steps such as the application of indices, table partitioning or compression strate-
gies. Thus, we get an impression of how the model behaves natively in different
databases. We chose PostgreSQL (PSQL) and Microsoft SQL Server (MSSQL)
as two prominent engines providing us with the required relational operations.
The hardware profile2 represents a standalone server environment commonly
used in small and medium-sized organizations. Most of our benchmark data sets
are extracted from [22] varying in data types and distribution. Table 1 states fur-
ther details. Both, PSQL and MSSQL provide similar query plans based on hash
2
    OS: Microsoft Windows 2012 R2 (Standard edition x64); DBs: Microsoft SQL Server
    2014 (Developer edition 12.0.2, 64-bit), PostgreSQL 9.4 (Compiled by Visual C++
    build 1800, 64-bit); Memory: 24 GByte; CPU: 16x2.6 GHz Intel Xeon E312xx (Sandy
    Bridge); HDD: 500 GByte


                                       154
           Table 1. Summarized characteristics of the assessed data sets
                Data set      Records     |A| |D|   |IN DA |   |IN DD |   |CA |
              HIGGS [24]     11.000.000   28    1 10.721.302         2 5.829.123
              RLCP [25]       5.749.132   11    1  5.749.132         2 5.728.201
              SUSY [24]       5.000.000   18    1  5.000.000         2 2.712.173
              KDD99           4.898.431   41    1  1.074.974        23 2.807.886
              KDD99 M         4.898.431   42    1  1.075.016        23 2.807.886
              PAMAP2 [26] 3.850.505       53    1  3.850.505        19 1.125.552
              Poker Hand      1.025.010   10    1  1.022.771        10 511.308
              Covertype [23]    581.012   54    1    581.012         7 297.711
              NSL-KDD [27]      148.517   41    1    147.790         2    71.361
              Spambase            4.601   57    1      4.207         2     1.810


algorithms which we review briefly to understand the priciples: The initial stage
consists of scanning two input sources from disk followed by hash aggregations.
Finally, both aggregated inputs are fused using the hash join operator. Denote,
a hash aggregation only requires one single scan of the given input to build the
resulting hash table. The hash join relies on a build and probe phase where es-
sentially each of the two incoming inputs is scanned only once. In comparison
to other alternatives, these query plans perform without sorting, but require
memory to build up the hash tables. Once a query runs out of memory, addi-
tional buckets are spilled to disk, which was not the case throughout the series
of experiments. Even though both engines share similar algorithms, MSSQL is
capable of running the queries in parallel while PSQL covers single core process-
ing only. In general, we realized a very high CPU usage which is characteristic
for the performance of our model. However we further observed that MSSQL
does not scale well processing KDD99, because it is unable to distribute the
workload evenly to all threads. We relate this issue to the lack of appropriate
statistics in the given raw environment including its data distribution, where
three equivalence classes represent 51% of all records. Therefore, we introduce
a revised version called KDD99 M. In contrast, it holds an additional condition
attribute splitting huge classes into chunks of 50K records. Note, this change
does not influence the approximation, but results in a speed up of 76%. Further
details of the runtime comparison are given in Figure 2. Summarized, we could
achieve reasonable responses without major optimization steps. In particular,
our model scales well appending additional cores in 9 out of 10 tests. Supported
by this characteristic, MSSQL computes most queries within few seconds.

7   Future Work
The evaluation of the previous section shows how our RST model behaves in
a native relational environment. However, further practical experiments are re-
quired, which we will address in the near future. In our domain of interest, i.e.
network and data security, we will study classical as well as modern cyber at-
tack scenarios in order to extract significant features of each single attack in
both IPv4 and IPv6 environments. Our model is most suited for that subject,
because it is designed to process huge amounts of data efficiently and can han-


                                          155
                                                                                                                                                                                                            PSQL (1Core)                   MSSQL (1Core)                 MSSQL (8Core)                        MSSQL (16Core)
150                                                   150                                                   120                                                   210                                                                                                          300
                                                                                                                                                                                                                        330
135                                                   135                                                                                                                                                                                                                      270
                                                                                                            105                                                   180                                                   300
120                                                   120                                                                                                                                                                                                                      240
                                                                                                             90                                                                                                         270
105                                                   105                                                                                                         150                                                                                                          210
                                                                                                                                                                                                                        240
                                                                                                             75
90                                                    90                                                                                                                                                                210                                                    180
                                                                                                                                                                  120
75                                                    75                                                     60                                                                                                         180                                                    150
                                                                                                                                                                   90                                                   150
60                                                    60                                                                                                                                                                                                                       120
                                                                                                             45
                                                                                                                                                                                                                        120
45                                                    45                                                                                                           60                                                                                                           90
                                                                                                             30                                                                                                          90
30                                                    30                                                                                                                                                                                                                        60
                                                                                                                                                                                                                         60
                                                                                                             15                                                    30
15                                                    15                                                                                                                                                                 30                                                     30

 0                                                      0                                                    0                                                     0                                                     0                                                       0
      HIGGS    RLCP    SUSY    KDD99 KDD99_M PAMAP2         HIGGS    RLCP    SUSY    KDD99 KDD99_M PAMAP2         HIGGS    RLCP    SUSY    KDD99 KDD99_M PAMAP2         HIGGS    RLCP    SUSY    KDD99 KDD99_M PAMAP2         HIGGS    RLCP     SUSY    KDD99 KDD99_M PAMAP2         HIGGS    RLCP    SUSY     KDD99 KDD99_M PAMAP2

11                                                    11                                                     7                                                     9                                                    16                                                     14
10                                                    10                                                                                                           8
                                                                                                             6                                                                                                          14                                                     12
 9                                                     9
                                                                                                                                                                   7
 8                                                     8                                                                                                                                                                12
                                                                                                             5                                                                                                                                                                 10
                                                                                                                                                                   6
 7                                                     7                                                                                                                                                                10
                                                                                                             4                                                     5                                                                                                            8
 6                                                     6
                                                                                                                                                                                                                         8
 5                                                     5                                                                                                           4
                                                                                                             3                                                                                                                                                                  6
 4                                                     4                                                                                                                                                                 6
                                                                                                                                                                   3
 3                                                     3                                                     2                                                                                                                                                                  4
                                                                                                                                                                                                                         4
                                                                                                                                                                   2
 2                                                     2
                                                                                                             1                                                                                                           2                                                      2
 1                                                     1                                                                                                           1

 0                                                     0                                                     0                                                     0                                                     0                                                      0
      Poker Hand   Covertype    NSL-KDD   Spambase          Poker Hand   Covertype    NSL-KDD   Spambase          Poker Hand   Covertype    NSL-KDD   Spambase          Poker Hand   Covertype    NSL-KDD   Spambase          Poker Hand    Covertype    NSL-KDD   Spambase          Poker Hand   Covertype     NSL-KDD   Spambase

              A-positive region                                 A-boundary region                                         A-lower approx.                                       A-upper approx.                                   A-boundary approx.                                         A-outside region


                         Fig. 2. Runtime comparison of the proposed rough set model in seconds


dle uncertainty which is required for proper intrusion detection. Additionally, we
will use the outcome of our model to generate precise and characteristic attack
signatures from incoming traffic and construct a rule-based classifier. Enabled by
in-database capabilities, we can compute the resulting decision rules in parallel
and integrate that approach into our existing data store. Hence, we can avoid
huge data transports which is crucial for our near real time system.

8                  Conclusion
In the past, the traditional Rough Set Theory has become a very popular frame-
work to analyze and classify data based on equivalence relations. In this work we
presented an approach to transport the concept approximation of that theory to
the domain of relational databases in order to make use of well-established and
efficient algorithms supported by these systems. Our model is defined on com-
plete data tables and compatible with data inconsistencies. The evaluation on
various prominent data sets showed promising results. The queries achieved low
latency along with minor optimization and preprocessing effort. Therefore, we
assume our model is suitable for a wide range of disciplines analyzing data within
its relational sources. That given, we introduced a compact mining toolkit which
is based on rough set methodology and enabled for in-database analytics. Imme-
diately, it can be utilized to efficiently explore data sets, expose decision rules,
identify significant features or data inconsistencies that are common challenges
in the process of knowledge discovery in databases.

Acknowledgments. The authors deeply thank Maren and Martin who pro-
vided expertise and excellent support in the initial phase of this work.


References
1. M. Gawrys, J. Sienkiewicz: RSL - The Rough Set Library - Version 2.0. Technical
   report, Warsaw University of Technology (1994).
2. I. Düntsch, G. Gediga: The Rough Set Engine GROBIAN. In: Proc. of the 15th
   IMACS World Congress, pp. 613–618 (1997).
3. A. Ohrn, J. Komorowski: ROSETTA - A Rough Set Toolkit for Analysis of Data.
   In: Proc. of the 3rd Int. Joint Conf. on Information Sciences, pp. 403–407 (1997).


                                                                                                                                                  156
4. J.G. Bazan, M. Szczuka: The Rough Set Exploration System. TRS III, LNCS, vol.
   3400, pp. 37–56 (2005).
5. M.C. Fernandez-Baizán, E. Menasalvas Ruiz, J.M. Peña Sánchez: Integrating RDMS
   and Data Mining Capabilities using Rough Sets. In: Proc. of the 6th Int. Conf. on
   IPMU, pp. 1439–1445 (1996).
6. A. Kumar: New Techniques for Data Reduction in a Database System for Knowledge
   Discovery Applications. JIIS, vol. 10(1), pp. 31–48 (1998).
7. X. Hu, T.Y. Lin, J. Han: A new Rough Set Model based on Database Systems. In:
   Proc. of the 9th Int. Conf. on RSFDGrC, LNCS, vol. 2639, pp. 114–121 (2003).
8. C.-C. Chan: Learning Rules from Very Large Databases using Rough Multisets.
   TRS I, LNCS, vol. 3100, pp. 59-77 (2004).
9. H. Sun, Z. Xiong, Y. Wang: Research on Integrating Ordbms and Rough Set Theory.
   In: Proc. of the 4th Int. Conf. on RSCTC, LNCS, vol. 3066, pp. 169-175 (2004).
10. T. Tileston: Have Your Cake & Eat It Too! Accelerate Data Mining Combining
   SAS & Teradata. In: Teradata Partners 2005 ”Experience the Possibilities” (2005).
11. Z. Pawlak: Rough Sets. Int. Journal of Computer and Information Science, vol.
   11(5), pp. 341–356 (1982).
12. Z. Pawlak: Rough Sets - Theoretical Aspects of Reasoning about Data (1991).
13. Z. Pawlak: Information Systems - Theoretical Foundations. Inform. Systems, vol.
   6(3), pp. 205–218 (1981).
14. F. Machuca, M. Millan: Enhancing Query Processing in Extended Relational
   Database Systems via Rough Set Theory to Exploit Data Mining Potentials. Knowl-
   edge Management in Fuzzy Databases, vol. 39, pp. 349–370 (2000).
15. H.S. Nguyen: Approximate Boolean Reasoning: Foundations and Applications in
   Data Mining. TRS V, LNCS, vol. 4100, pp. 334–506 (2006).
16. U. Seelam, C.-C. Chan: A Study of Data Reduction Using Multiset Decision Tables.
   In: Proc. of the Int. Conf. on GRC, IEEE, pp. 362–367 (2007).
17. S. Naouali, R. Missaoui: Flexible Query Answering in Data Cubes. In: Proc. of the
   7th Int. Conf. of DaWaK, LNCS, vol. 3589, pp. 221–232 (2005).
18. T. Beaubouef, F.E. Petry: A Rough Set Model for Relational Databases. In: Proc.
   of the Int. Workshop on RSKD, pp. 100–107 (1993).
19. L.-L. Wei, W. Zhang: A Method for Rough Relational Database Transformed into
   Relational Database. In: Proc. of the Int. Conf. on SSME, IEEE, pp. 50–52 (2009).
20. D. Slezak, J. Wroblewski, V. Eastwood, P. Synak: Brighthouse: An Analytic Data
   Warehouse for Ad-hoc Queries. In: Proc. of the VLDB Endowment, vol. 1, pp.
   1337–1345 (2008).
21. T.Y. Lin: An Overview of Rough Set Theory from the Point of View of Relational
   Databases. Bulletin of IRSS, vol. 1(1), pp. 30–34 (1997).
22. K. Bache and M. Lichman: UCI Machine Learning Repository. University of Cali-
   fornia, Irvine, http://archive.ics.uci.edu/ml (June, 2015).
23. J.A. Blackard, D.J. Dean: Comparative Accuracies of Neural Networks and Dis-
   criminant Analysis in Predicting Forest Cover Types from Cartographic Variables.
   In: Second Southern Forestry GIS Conf., pp. 189–199 (1998).
24. P. Baldi, P. Sadowski, D. Whiteson. Searching for Exotic Particles in High-energy
   Physics with Deep Learning. Nature Communications 5 (2014).
25. I. Schmidtmann, G. Hammer, M. Sariyar, A. Gerhold-Ay: Evaluation des Kreb-
   sregisters NRW Schwerpunkt Record Linkage. Technical report, IMBEI (2009).
26. A. Reiss, D. Stricker: Introducing a New Benchmarked Dataset for Activity Mon-
   itoring. In: Proc. of the 16th ISWC, IEEE, pp. 108–109 (2012).
27. NSL-KDD: Data Set for Network-based Intrusion Detection Systems.
   http://nsl.cs.unb.ca/NSL-KDD (June, 2015).


                                      157

</pre>