Introduction

Patterns via Clustering as a Data Mining Tool

Lars Lumpe

larslumpe@gmail.com1 0

Stefan E. Schmidt

0 0 Institut fu ̈r Algebra, Technische Universita ̈t Dresden

Research shows that pattern structures are a useful tool for analyzing complex data. In our paper, we present a new general framework of using pattern structures as a data mining tool and as an application of the new framework we show a way to handle a classification problem of red wines.

Introduction

Pattern structures within the framework of formal concept analysis have been introduced in [ 4 ]. Since then they have been the subject of further investigations like in [ 2, 9, 11, 10, 12 ] and have turned out to be a useful tool for analyzing various real-world applications (cf. [ 4–8 ]). In this paper, we want to present a new application. But first we will introduce a general way to construct a pattern structure. Then we are going to use several clustering algorithm to find important pattern in it. Further we will use this pattern to build a model to solve a classification problem. In particular, we will take the dataset from [ 3 ] and train an algorithm to predict the quality of red wines.

We start with some general definitions:

Definition 1 (restriction). Let P := (P, ≤P) be a poset, then for every set U the poset

P | U := (U, ≤P ∩(U × U )) is called the restriction of P onto U .

If we consider a poset of patterns, it often arises as a dual of a given poset. Definition 2 (opposite or dual poset). Let P := (P, ≤P) be a poset. Then we call

Pop := (P, ≥P) with ≥P:= {(q, p) ∈ P × P | p ≤ q} the opposite or dual of P.

Definition 3 (Interval Poset). Let P := (P, ≤P) be a poset. Then we call

IntP := {[p, q]P | p, q ∈ P } with [p, q]P := {t ∈ P | p ≤P t ≤P q} the set of all intervals of P, and we refer to IntP := (IntP, ⊆) as the interval poset of P.

Remark: (i) Let P be a poset. Then IntP is a lower bounded poset, that is, ∅ is the least element of IntP, provided P has at least 2 elements. Furthermore if P is a (complete) lattice than so is IntP. (ii) If A is a set of attributes than (RA, ≤) := (R, ≤)A is a lattice. With (i) it follows that Int(RA, ≤) is a lower bounded lattice.

Definition 4 (kernel operator). A kernel operator on a poset P := (P, ≤) is a map γ : P → P such that for all x, y ∈ P : kx ≤ y ⇔ kx ≤ ky (1)

A subset ζ of P is called a kernel system in P if for every x ∈ P the restriction of P onto {t ∈ ζ | t ≤ x} has a greatest element.

Remark: A closure operator on P := (P, ≤) is defined as a kernel operator on Pd, and a closure system in P is defined as a kernel system in Pd.

The main definitions of FCA are based on a binary relation I between a set of so called objects G and a set of so called attributes M . However, in many real-world knowledge discovery problems, researchers have to deal with data sets that are much more complex than binary data tables. In our case, there was a set of numerical attributes, such as the amount of acetic acid, density, the amount of salt, etc., describing the quality of a red wine. To deal with this kind of data, pattern structures are a useful tool.

Definition 5 (pattern setup, pattern structure). A triple P = (G, D, δ) is a pattern setup if G is a set, D = (D, v) is a poset, and δ : G → D is a map. In case every subset of δG := {δg | g ∈ G} has an infimum in D, we will refer to P as pattern structure.

For pattern structures, an important complexity reduction is often provided by so-called o-projections: Proposition 1 (o-projection). For a pattern structure P := (G, E, ε) and a kernel operator κ on E, the triple with D := E|D where D := κE and opr(P , κ) := (G, D, δ) δ : G → D, g 7→ κ(εg) is a pattern structure, called the o-projection of P via κ (see [ 2 ]).

Construction of a pattern structure

The following definition establishes the connection between pattern setups and pattern structures.

Definition 6 (embedded pattern structure). Let E be a complete lattice and P := (G, D, δ) a pattern setup with D := E|D. Then we call

Pe := (G, E, D, δ) with δ : G → L, g 7→ δg the embedded pattern structure. We say the pattern setup P is embedded in the pattern structure (G, E, δ).

This definition shows, that it is possible to build a pattern structure from a given pattern stup. The construction below is another demonstration of how to build a pattern structure from a pattern setup.

Construction 1. Let G be a set and let L := (L, ≤) be a poset, then for every map % : G → L the elementary pattern structure is given by:

P% := (G, E, ε) with E := (2L, ⊇) and ε : G → 2L, g 7→ {%g}.

Hence, the pattern setup (G, L, %) is embedded in the pattern structure (G, E, ε). In many cases this construction leads to a large set of patterns. Therefore we need the following: Let ζ be a closure system in 2L := (2L, ⊆), that is, ζ is a kernel system in E and let γ : 2L → 2L be the associated closure operator of ζ w.r.t. 2L. Thus, γ is a kernel operator on E. Then (G, D, δ) is a pattern structure for D := E|ζ and

Indeed the map

is a residual map from E to D with δ = ψ ◦ ε.

By the above proposition, (G, D, δ) is a pattern structure, since δ : G → D, g 7→ γ{g}. ψ : 2L → ζ, X 7→ γX opr(P%, γ) = (G, D, δ) is the o-projection of P% via γ. 4

Connection to Data Mining and to our Dataset In this section we describe how we use the construction 1 to handle a typical data mining classification problem. We train a model to predict classes of red wines. But first we give an insight to our data. 4.1 To apply our previous results on a public data set we choose the red wine data set from [ 3 ]. There are 1599 examples of wines, described by 11 numerical attributes. The input includes objective tests (e.g. fixed acidity, sulphates, PH values, residual sugar, chlorides, density, alcohol...) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). For our purpose, we established a binary distinction where every wine with a quality score above 5 is classified ”good” and all below as ”bad”. This led to a set of 855 positive and 744 negative examples. We split the data into a training set (75% of the examples) and a test set (25% of the examples). 4.2

Describing the Proceeding Many data mining relevant data sets (like the red wine dataset) can be described via an evaluation matrix: Definition 7 (evaluation map, evaluation setup). Let G be a finite set, M a set of attributes and Wm := (Wm, ≤m) a complete lattice for every attribute m ∈ M . Further, let

Then, a map such that

W := Y

m∈M α : G → W, g 7→

Wm and W :=

Wm.

Y m∈M

Y m∈M {αmg} αm : G → Wm, g 7→ wm is called evaluation map. We call E := (G, M, W, α) an evaluation setup. Example 1. In the wine data set in [ 3 ] we can interpret the wines as a set G, the describing attributes as the set M and Wm as the numerical range of attribute m with the natural order.

In the above example the the evaluation map

α : G × M → W assigns to every wine bottle the values of all attributes m ∈ M . This map is a good starting point for an elementary pattern structure with

L := Y m∈M

Wm and L :=

Wm.

Y m∈M Thus, E := (2L, ⊇) is the dually ordered power set of vectors with values of the attributes, which describe the wine. On E we installed the following kernel operator

γ : 2L → 2L, X 7→ [infLX, supLX]L, As a matter of fact, is the dual interval lattice of L and the map δ is given by This leads to the o-projection of the elementary pattern structure P% via γ, that is,

(G, D, δ) := opr(P%, γ).

D = (D, ⊇) with D = IntL

δ : G → D, g 7→ {%g}.

Often the dual power set lattice E is too large for applications. Therefore, we concentrate on relevant patterns in D, that is, in the dual interval lattice of E.

To identify important patterns in E for the red wine classification, we looked at the positive examples of the training set and combined the results of different clustering algorithms implemented in python. In particular, we used a k-means algorithm, k-medoids algorithm (with metrices Mahalanobis, Euclidean and correlation), a Gausian Mixture Model and a Bayesian Gaussian Mixture Model to cluster the good red wines. Furthermore we interprete the leaves of decision trees (with Gini Impurity and entropy as splitting measure) as cluster of wines to find important patterns in E for our case. The same cluster algorithm can lead to different output clusters; this is a result of the different metrics, which were used to measure the distance and the randomly choosen starting points of the algorithms. Hence we tried every algorithm 5 times. For every attempt we used different specifications. The number of clusters for the k-medoids, k-means, Gausian Mixture Model and Gausian Mixture Model algorithm is set randomly between 2 and 50. For the decision trees we set the number of examples in a leaf to at least 100. This leads to more than 700 clusters in E.

Via the kernel operator on E:

γ : 2L → 2L, X 7→ [infLX, supLX]L we get patterns in D of the clusters in E. Since γ is a closure operator on the power set 2L := (2L, ⊆) we can think of the patterns as closures of clusters. On the next step we eliminated all clusters with less than 100 wines. Then we looked at the ratio of good examples (wines with a scoring of 5 or better) and all examples (good and bad) in the patterns and took the five patterns with the best ratio. These patterns are listed below. The range of the attributes is printed in red. For a better interpretability we scaled every attribute to the range [ 0, 1 ].

Fig. 1. Interval 1: decision tree (entropy), 108 wines, 108 good and 0 bad Combining these 5 intervals to predict the classes of the test set leads to the following confusion matrix: The following table presents a comparison of our method to other algorithms.

Our method is easy to interprete and leads to the second best precision of all listed algorithms, but the recall value is the worst under all methods. More patterns would probably lead to a better recall, but likely worsen the precision. Further investigations are needed to find the best collections of patterns for different usecases (e.g. maximize accuracy). The here presented proceeding is just an example of building a model from our framework. Hopefully, further investigations show, that it is possible to create stronger models with our framework. We introduced a new general framework for the application of pattern structures. Then we gave an example how this general framework can be used to predict the quality of red wines. In the presented way the pattern structures can be a useful tool in analysing data. As shown here they are capable to give good predictions and the good interpretability makes them even more powerful.

T.S.

Blyth ,

M.F.

Janowitz ( 1972 ), Residuation Theory, Pergamon Press, pp. 1 - 382 .

Buzmakov ,

S. O.

Kuznetsov ,

Napoli ( 2015 ) , Revisiting Pattern Structure Projections . Formal Concept Analysis. Lecture Notes in Artificial Intelligence (Springer) , Vol. 9113 , pp 200 - 215 .

Cortez ,

Cerdeira ,

Almeida ,

Matos , J. Reis ( 2009 ), Modeling wine preferences by data mining from physicochemical properties . Decision Support Systems 47 ( 4 ), pp. 547 - 553 .

Ganter ,

S. O.

Kuznetsov ( 2001 ), Pattern Structures and Their Projections . Proc. 9th Int. Conf. on Conceptual Structures , ICCS'01,

Stumme and H. Delugach (Eds.). Lecture Notes in Artificial Intelligence (Springer) , Vol. 2120 , pp. 129 - 142 .

5. T. B. Kaiser , S. E. Schmidt ( 2011 ), Some remarks on the relation between annotated ordered sets and pattern structures . Pattern Recognition and Machine Intelligence. Lecture Notes in Computer Science (Springer), Vol. 6744 , pp 43 - 48 .

Kaytoue ,

S. O.

Kuznetsov ,

Napoli ,

Duplessis ( 2011 ), Mining gene expression data with pattern structures in formal concept analysis . Information Sciences (Elsevier) , Vol. 181 , pp. 1989 - 2001 .

S. O.

Kuznetsov ( 2009 ), Pattern structures for analyzing complex data . In H. Sakai et al. (Eds.). Proceedings of the 12th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC'09). Lecture Notes in Artificial Intelligence (Springer) , Vol. 5908 , pp. 33 - 44 .

S. O.

Kuznetsov ( 2013 ), Scalable Knowledge Discovery in Complex Data with Pattern Structures . In: P. Maji , A.

Ghosh , M.N.

Murty , K.

Ghosh , S.K.

Pal , (Eds.). Proc. 5th International Conference Pattern Recognition and Machine Intelligence (PReMI'2013). Lecture Notes in Computer Science (Springer), Vol. 8251 , pp. 30 - 41 .

Lumpe and

S. E.

Schmidt ( 2015 ), A Note on Pattern Structures and Their Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence (Springer) , Vol. 9113 , pp 145 - 150 .

10. L. Lumpe , S. E. Schmidt ( 2016 ), Morphisms Between Pattern Structures and Their Impact on Concept Lattices, FCA4AI@ ECAI 2016 , pp 25 - 34 .

11. L. Lumpe , S. E. Schmidt ( 2015 ), Pattern Structures and Their Morphisms . CLA 2015 , pp 171 - 179 .

12. L. Lumpe , S. E. Schmidt ( 2016 ), Viewing Morphisms Between Pattern Structures via Their Concept Lattices and via Their Representations, International Symposium on Methodologies for Intelligent Systems 2017 , pp. 597 - 608 .

13.

H. S.

Park ,

C. H.

Jun ( 2009 ), A simple and fast algorithm for K-medoids clustering . Expert systems with applications 36 ( 2 ), pp. 3336 - 3341 .