Patterns via Clustering as a Data Mining Tool Lars Lumpe1 and Stefan E. Schmidt2 Institut für Algebra, Technische Universität Dresden larslumpe@gmail.com1 , midt1@msn.com2 Abstract. Research shows that pattern structures are a useful tool for analyzing complex data. In our paper, we present a new general frame- work of using pattern structures as a data mining tool and as an appli- cation of the new framework we show a way to handle a classification problem of red wines. 1 Introduction Pattern structures within the framework of formal concept analysis have been introduced in [4]. Since then they have been the subject of further investiga- tions like in [2, 9, 11, 10, 12] and have turned out to be a useful tool for analyzing various real-world applications (cf. [4–8]). In this paper, we want to present a new application. But first we will introduce a general way to construct a pattern structure. Then we are going to use several clustering algorithm to find impor- tant pattern in it. Further we will use this pattern to build a model to solve a classification problem. In particular, we will take the dataset from [3] and train an algorithm to predict the quality of red wines. 2 Preliminaries We start with some general definitions: Definition 1 (restriction). Let P := (P, ≤P ) be a poset, then for every set U the poset P | U := (U, ≤P ∩(U × U )) is called the restriction of P onto U . If we consider a poset of patterns, it often arises as a dual of a given poset. Definition 2 (opposite or dual poset). Let P := (P, ≤P ) be a poset. Then we call Pop := (P, ≥P ) with ≥P := {(q, p) ∈ P × P | p ≤ q} the opposite or dual of P. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Definition 3 (Interval Poset). Let P := (P, ≤P ) be a poset. Then we call IntP := {[p, q]P | p, q ∈ P } with [p, q]P := {t ∈ P | p ≤P t ≤P q} the set of all intervals of P, and we refer to IntP := (IntP, ⊆) as the interval poset of P. Remark: (i) Let P be a poset. Then IntP is a lower bounded poset, that is, ∅ is the least element of IntP, provided P has at least 2 elements. Furthermore if P is a (complete) lattice than so is IntP. (ii) If A is a set of attributes than (RA , ≤) := (R, ≤)A is a lattice. With (i) it follows that Int(RA , ≤) is a lower bounded lattice. Definition 4 (kernel operator). A kernel operator on a poset P := (P, ≤) is a map γ : P → P such that for all x, y ∈ P : kx ≤ y ⇔ kx ≤ ky (1) A subset ζ of P is called a kernel system in P if for every x ∈ P the restriction of P onto {t ∈ ζ | t ≤ x} has a greatest element. Remark: A closure operator on P := (P, ≤) is defined as a kernel operator on Pd , and a closure system in P is defined as a kernel system in Pd . The main definitions of FCA are based on a binary relation I between a set of so called objects G and a set of so called attributes M . However, in many real-world knowledge discovery problems, researchers have to deal with data sets that are much more complex than binary data tables. In our case, there was a set of numerical attributes, such as the amount of acetic acid, density, the amount of salt, etc., describing the quality of a red wine. To deal with this kind of data, pattern structures are a useful tool. Definition 5 (pattern setup, pattern structure). A triple P = (G, D, δ) is a pattern setup if G is a set, D = (D, v) is a poset, and δ : G → D is a map. In case every subset of δG := {δg | g ∈ G} has an infimum in D, we will refer to P as pattern structure. For pattern structures, an important complexity reduction is often provided by so-called o-projections: Proposition 1 (o-projection). For a pattern structure P := (G, E, ε) and a kernel operator κ on E, the triple opr(P , κ) := (G, D, δ) with D := E|D where D := κE and δ : G → D, g 7→ κ(εg) is a pattern structure, called the o-projection of P via κ (see [2]). 3 Construction of a pattern structure The following definition establishes the connection between pattern setups and pattern structures. Definition 6 (embedded pattern structure). Let E be a complete lattice and P := (G, D, δ) a pattern setup with D := E|D. Then we call Pe := (G, E, D, δ) with δ : G → L, g 7→ δg the embedded pattern structure. We say the pattern setup P is embedded in the pattern structure (G, E, δ). This definition shows, that it is possible to build a pattern structure from a given pattern stup. The construction below is another demonstration of how to build a pattern structure from a pattern setup. Construction 1. Let G be a set and let L := (L, ≤) be a poset, then for every map % : G → L the elementary pattern structure is given by: P% := (G, E, ε) with E := (2L , ⊇) and ε : G → 2L , g 7→ {%g}. Hence, the pattern setup (G, L, %) is embedded in the pattern structure (G, E, ε). In many cases this construction leads to a large set of patterns. There- fore we need the following: Let ζ be a closure system in 2L := (2L , ⊆), that is, ζ is a kernel system in E and let γ : 2L → 2L be the associated closure operator of ζ w.r.t. 2L . Thus, γ is a kernel operator on E. Then (G, D, δ) is a pattern structure for D := E|ζ and δ : G → D, g 7→ γ{g}. Indeed the map ψ : 2L → ζ, X 7→ γX is a residual map from E to D with δ = ψ ◦ ε. By the above proposition, (G, D, δ) is a pattern structure, since opr(P% , γ) = (G, D, δ) is the o-projection of P% via γ. 4 Connection to Data Mining and to our Dataset In this section we describe how we use the construction 1 to handle a typical data mining classification problem. We train a model to predict classes of red wines. But first we give an insight to our data. 4.1 Red Wine Dataset To apply our previous results on a public data set we choose the red wine data set from [3]. There are 1599 examples of wines, described by 11 numerical at- tributes. The input includes objective tests (e.g. fixed acidity, sulphates, PH values, residual sugar, chlorides, density, alcohol...) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each ex- pert graded the wine quality between 0 (very bad) and 10 (very excellent). For our purpose, we established a binary distinction where every wine with a quality score above 5 is classified ”good” and all below as ”bad”. This led to a set of 855 positive and 744 negative examples. We split the data into a training set (75% of the examples) and a test set (25% of the examples). 4.2 Describing the Proceeding Many data mining relevant data sets (like the red wine dataset) can be described via an evaluation matrix: Definition 7 (evaluation map, evaluation setup). Let G be a finite set, M a set of attributes and Wm := (Wm , ≤m ) a complete lattice for every attribute m ∈ M . Further, let Y Y W := Wm and W := Wm . m∈M m∈M Then, a map Y α : G → W, g 7→ {αm g} m∈M such that αm : G → Wm , g 7→ wm is called evaluation map. We call E := (G, M, W, α) an evaluation setup. Example 1. In the wine data set in [3] we can interpret the wines as a set G, the describing attributes as the set M and Wm as the numerical range of attribute m with the natural order. In the above example the the evaluation map α:G×M →W assigns to every wine bottle the values of all attributes m ∈ M . This map is a good starting point for an elementary pattern structure with Y Y L := Wm and L := Wm . m∈M m∈M Thus, E := (2 , ⊇) is the dually ordered power set of vectors with values of L the attributes, which describe the wine. On E we installed the following kernel operator γ : 2L → 2L , X 7→ [infL X, supL X]L , This leads to the o-projection of the elementary pattern structure P% via γ, that is, (G, D, δ) := opr(P% , γ). As a matter of fact, D = (D, ⊇) with D = IntL is the dual interval lattice of L and the map δ is given by δ : G → D, g 7→ {%g}. Often the dual power set lattice E is too large for applications. Therefore, we concentrate on relevant patterns in D, that is, in the dual interval lattice of E. To identify important patterns in E for the red wine classification, we looked at the positive examples of the training set and combined the results of different clustering algorithms implemented in python. In particular, we used a k-means algorithm, k-medoids algorithm (with metrices Mahalanobis, Euclidean and cor- relation), a Gausian Mixture Model and a Bayesian Gaussian Mixture Model to cluster the good red wines. Furthermore we interprete the leaves of decision trees (with Gini Impurity and entropy as splitting measure) as cluster of wines to find important patterns in E for our case. The same cluster algorithm can lead to different output clusters; this is a result of the different metrics, which were used to measure the distance and the randomly choosen starting points of the algorithms. Hence we tried every algorithm 5 times. For every attempt we used different specifications. The number of clusters for the k-medoids, k-means, Gausian Mixture Model and Gausian Mixture Model algorithm is set randomly between 2 and 50. For the decision trees we set the number of examples in a leaf to at least 100. This leads to more than 700 clusters in E. Via the kernel operator on E: γ : 2L → 2L , X 7→ [infL X, supL X]L we get patterns in D of the clusters in E. Since γ is a closure operator on the power set 2L := (2L , ⊆) we can think of the patterns as closures of clusters. On the next step we eliminated all clusters with less than 100 wines. Then we looked at the ratio of good examples (wines with a scoring of 5 or better) and all examples (good and bad) in the patterns and took the five patterns with the best ratio. These patterns are listed below. The range of the attributes is printed in red. For a better interpretability we scaled every attribute to the range [0, 1]. Fig. 1. Interval 1: decision tree (entropy), 108 wines, 108 good and 0 bad Fig. 2. Interval 2: decision tree (entropy), 158 wines, 158 good and 0 bad Fig. 3. Interval 3: decision tree (gini), 128 wines, 127 good and 1 bad Fig. 4. Interval 4: k-medoids (mahalanobis), 163 wines, 160 good and 3 bad Fig. 5. Interval 5: k-medoids (mahalanobis), 216 wines, 209 good and 7 bad Combining these 5 intervals to predict the classes of the test set leads to the following confusion matrix: Fig. 6. Combination of the 5 intervals The following table presents a comparison of our method to other algorithms. Our method is easy to interprete and leads to the second best precision of all listed algorithms, but the recall value is the worst under all methods. More patterns would probably lead to a better recall, but likely worsen the precision. Further investigations are needed to find the best collections of patterns for dif- ferent usecases (e.g. maximize accuracy). The here presented proceeding is just an example of building a model from our framework. Hopefully, further investi- gations show, that it is possible to create stronger models with our framework. Model Accuracy Presicion Recall F1-score cluster pattern 71,5% 77,1% 65,5% 70,7% K-nearest neighbors 74,2% 73,0% 81,0% 76,9% Support Vector 73,5% 75,6% 73,4% 74,5% Naive Bayes 70,0% 69,2% 77,8% 73,2% logistic regression 73,2% 73,6% 76,8% 75,2% Random Forest 78,0% 80,0% 77,7% 78,8% Table 1. Comparison of different classifier algorithms 5 Conclusion We introduced a new general framework for the application of pattern structures. Then we gave an example how this general framework can be used to predict the quality of red wines. In the presented way the pattern structures can be a useful tool in analysing data. As shown here they are capable to give good predictions and the good interpretability makes them even more powerful. References 1. T.S. Blyth, M.F.Janowitz (1972), Residuation Theory, Pergamon Press, pp. 1-382. 2. A. Buzmakov, S. O. Kuznetsov, A. Napoli (2015) , Revisiting Pattern Structure Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence (Springer), Vol. 9113, pp 200-215. 3. P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis (2009), Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47(4), pp. 547-553. 4. B. Ganter, S. O. Kuznetsov (2001), Pattern Structures and Their Projections. Proc. 9th Int. Conf. on Conceptual Structures, ICCS’01, G. Stumme and H. Del- ugach (Eds.). Lecture Notes in Artificial Intelligence (Springer), Vol. 2120, pp. 129-142. 5. T. B. Kaiser, S. E. Schmidt (2011), Some remarks on the relation between an- notated ordered sets and pattern structures. Pattern Recognition and Machine Intelligence. Lecture Notes in Computer Science (Springer), Vol. 6744, pp 43-48. 6. M. Kaytoue, S. O. Kuznetsov, A. Napoli, S. Duplessis (2011), Mining gene expres- sion data with pattern structures in formal concept analysis. Information Sciences (Elsevier), Vol.181, pp. 1989-2001. 7. S. O. Kuznetsov (2009), Pattern structures for analyzing complex data. In H. Sakai et al. (Eds.). Proceedings of the 12th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC’09). Lecture Notes in Artificial Intelligence (Springer), Vol. 5908, pp. 33-44. 8. S. O. Kuznetsov (2013), Scalable Knowledge Discovery in Complex Data with Pat- tern Structures. In: P. Maji, A. Ghosh, M.N. Murty, K. Ghosh, S.K. Pal, (Eds.). Proc. 5th International Conference Pattern Recognition and Machine Intelligence (PReMI’2013). Lecture Notes in Computer Science (Springer), Vol. 8251, pp. 30- 41. 9. L. Lumpe and S. E. Schmidt (2015), A Note on Pattern Structures and Their Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence (Springer), Vol. 9113, pp 145-150. 10. L. Lumpe, S. E. Schmidt (2016), Morphisms Between Pattern Structures and Their Impact on Concept Lattices, FCA4AI@ ECAI 2016, pp 25-34. 11. L. Lumpe, S. E. Schmidt (2015), Pattern Structures and Their Morphisms. CLA 2015, pp 171-179. 12. L. Lumpe, S. E. Schmidt (2016), Viewing Morphisms Between Pattern Struc- tures via Their Concept Lattices and via Their Representations, International Symposium on Methodologies for Intelligent Systems 2017, pp. 597-608. 13. H. S. Park, C. H. Jun (2009), A simple and fast algorithm for K-medoids cluster- ing. Expert systems with applications 36(2), pp. 3336-3341.