Exploring the Dataset Structure by Means of Delta-Classes of Equivalence. The Case of the Titanic Dataset ? Aleksey Buzmakov1 , Sergei O. Kuznetsov1 , Tatyana Makhalova2 , and Amedeo Napoli1,2 1 National Research University Higher School of Economics, Russia 2 LORIA (CNRS – Inria NGE – University of Lorraine), Vandœuvre-lès-Nancy, France {avbuzmakov,skuznetsov}@hse.ru, tatiana.makhalova@inria.fr, amedeo.napoli@loria.fr Abstract. Being able to have a quick look at a dataset is essential in many applications. One way is to summarize the dataset by means of a small set of patterns. In this paper we suggest defining such set of pat- terns as the closed elements of delta-classes of equivalence. This approach allow us to propose an overview of a dataset and then, if necessary, any delta-class of equivalence can be expanded to provide more detailed in- formation about a certain part of the dataset. To demonstrate our proposal we deeply studied the Titanic dataset about survival of passengers and showed the connections between the passenger attributes and a possible dataset summary in terms of patterns. Keywords: FCA · Δ-closure · Δ-concepts · use case · Δ-implications. 1 Introduction In this paper, we are interested in pattern or itemset mining in tabular data. There is a considerable amount of work on many aspects of this subject, espe- cially regarding algorithms and search for interesting patterns [1]. One recurrent problem in itemset mining is the exponential number of resulting itemsets. Focus- ing on closed itemsets allows a significant reduction of this number by replacing a whole class of itemsets with the largest one, i.e., the closed itemset, which has the same support [7]. Nowadays, there are very efficient algorithms for comput- ing frequent closed itemsets [6], even for low frequency thresholds. However, the efficient generation of closed itemsets only partially solves the problem of the exponential explosion of itemsets, since the main difficulties appear afterwards, when the generated itemsets are processed. An alternative to the exhaustive enumeration of itemsets is based on “sam- pling” [4] and on a gradual search for itemsets according to an interestingness measure or a set of constraints [8]. Such algorithms usually result in a rather ? The reported study was funded by RFBR, project number 20-31-70047 Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). small set of itemsets while they may provide only an approximate solution. Al- though both approaches use quite different techniques, they rely on the same assumption, namely that the “internal or intrinsic structure” of the dataset un- der study can be understood by means of subset of selected itemsets. Then, in each approach a particular set of itemsets is returned, which provides a “multifaceted view” of the intrinsic structure underlying the data. Our approach is based on Δ-classes of equivalence, the generalization of stan- dard classes of equivalence based on closure operator. A user-set parameter ∆ measures how much a closed set can differ from its upper neighbors in the partial order of closed sets. A ∆-class of equivalence allows one to characterize the distribution underly- ing the data, i.e., when ∆ is large, there are only a few ∆-classes of equivalence whose elements are very stable, while when ∆ is small, the number of ∆-classes increases and the related information becomes less stable. Moreover, the ∆- classes of equivalence are very stable for large ∆ and do not significantly depend on the data sampling used for the analysis. In this paper we study ∆-classes for Titanic dataset. In particular, we show what kind of conclusions w.r.t. the passengers of Titanic can be made. Further we study what kind of information can be stored in “implications” of the Δ-classes of equivalence. In particular, such implications can show what information is usually associated with a set of attributes. The paper has the following structure. First we introduce basic definitions related to generalized closure operator. Then in Section 3 we evaluate this closure operator on Titanic dataset. 2 Δ-classes of equivalence The proposed approach is introduced in terms of Formal Concept Analysis (FCA) [5] and the following notation. A formal context is a triple (G, M, I), where G and M are sets of objects and attributes correspondingly and I ⊆ G×M is a relation between them. Derivation operator is denoted with arrows in order to clearly show the range and domain of the corresponding mappings: A↑ = {m ∈ M | (∀g ∈ A)(g, m) ∈ I}, A ⊆ G (1) ↓ B = {g ∈ G | (∀m ∈ B)(g, m) ∈ I}, B ⊆ M (2) We should note that operators (·)↑↓ and (·)↓↑ form closure operators on 2G and 2M , respectively. Hereafter, we focus on the operator (·)↓↑ and its general- izations. Let us reformulate the definition in terms of “instance counting”. Let B be any set of attributes. Definition 1. An attribute set B is closed iff (∀m ∈ M \B)|(B ∪{m})↓ | = 6 |B ↓ |. It can be seen that (·)↓↑ maps any set of attributes to a closed set of attributes. Let us reformulate Definition 1 in the following equivalent way. B is closed iff (∀m ∈ M \ B)[|B ↓ | − |(B ∪ {m})↓ | ≥ 1]. This form allows changing the threshold 1 at the end of the formula to any other positive value. The larger this value, the less attribute sets would pass a test similar to the one from Definition 1. This leads to the definition of Δ-closedness of an attribute set [2]. Definition 2. A set of attributes B is called Δ-closed if for any m ∈ M : |B ↓ | − |(B ∪ {m})↓ | ≥ ∆ ≥ 1. (3) In [2] it was shown that Δ-closedness is a closure operator. In particular it means that from a computational point of view, given a non Δ-closed set of attributes B, i.e., ∃m ∈ M (|B ↓ | − |(B ∪ {m})↓ < ∆), it can be closed by iteratively changing B to B ∪ {m}, for any m violating (3) until such attribute is not found. The corresponding closure operator is denoted by (·)∆ . Since it is a closure operator, it divides all sets of attributes 2M into classes of equivalences having the same closure. Definition 3. Given an attribute set B, its equivalence class Equiv∆ (B) is the set of all attribute sets with the closure equal to the closure of B, i.e., Equiv∆ (B) = {X ⊆ M | (X)∆ = (B)∆ }. (4) Moreover, since according to Definitions 1 and 2 if a set of attributes is Δ-closed than it is necessary closed. Thus, these Δ-classes of equivalence are joins of several closure-based classes of equivalence. It allows introducing a new derivation operator related to Δ-closure. A↑∆ = (A↑ )∆ , A ⊆ G (5) ↓∆ ∆↓ B = (B) ,B ⊆ M (6) The new Δ-derivation operator allows defining Δ-concepts ordered within a lattice in the similar way to the classical formal concepts. Moreover, since any Δ-closed set of attributes is a closed set of attributes, than the set of Δ-concepts are subset of formal concepts and the order of Δ-concepts is a suborder of the corresponding formal lattice. Finally, we should discuss Δ-implications since they provide a useful tool for finding associations between attribute sets. Definition 4. A rule A → B is called Δ-implication if B = A∆ \ A, i.e., A and ∆ A ∪ B are from the same Δ-class of equivalence. Since Δ-closure can change the support of the attribute set a Δ-implication is not necessary an implication. However, the set of Δ-implications is subset of all association rules, thus they are more easy analysable. In particular, given an implication A → B, the set B is the set of attributes that are associated with A in most samples from the underlying distribution. In the next section we experimentally show how such lattices of Δ-concepts and the corresponding implications can be used for data analysis. Moreover, since Δ-concepts can be found in polynomial time3 [3], such analysis is suitable for processing really big data. 3 Evaluation 3.1 Dataset In this paper we use Titanic dataset downloaded (train dataset) from Kaggle4 . This is one of the most known datasets with easily interpretable patterns that do not require deep diving into the domain knowledge. The dataset describes 891 passengers of the last Titanic ship travel. Every passenger is described with name, age, sex, the number of parents and/or children and the number of spouse and/or siblings travelled together with the passenger. The ticket price and the ticket class is also known as well as the survival state of the passenger after the Titanic shipwreck. All numerical data is divided into 5 percentiles and then inter-ordinal scaling is used on top of these percentiles. For example, for the quantity Age it is known from the data that 15 of the passengers were below 19 years old, the next 51 between 19 and 25, then between 25 and 32 and then between 32 and 41 and finally the last 15 of the passengers were above 41 years old. Then new binary attributes are formed based on these limits (19, 25, 32,41), these eight attributes are “Age ≥ 19,” “Age ≤ 19”. “Age ≥ 25,” “Age ≤ 25”. “Age ≥ 32,” “Age ≤ 32”. “Age ≥ 41,” “Age ≤ 41”. Additionally, from the ”Name” field the social status is extracted, including ”Mr”, ”Mrs”, ”Master”, etc. It makes in total 49 attributes. 3.2 Concept lattice navigation Even for such relatively simple data the total number of concepts is 9002. It is not hard to build such a lattice. However, analysis of the lattice is quite hard. It is hardly possible to draw the whole lattice and the only way is to navigate it from the top or from the bottom concepts. However, ∆-classes of equivalence give another means for such analysis and navigation. Let us first increase the ∆ threshold for the lattice. If ∆ = 90, then the lattice size is only 11 and it can be drawn. It is shown in Figure 1. For every concept the corresponding extent size and ∆-measure are shown. Every attribute is shown outside of the concept with an arrow attached to the concept of the first attribute entry. It can be seen that the lattice involves only 9 attributes out of 49. All other attributes are attached to the BOTTOM concept and are not shown. It 3 Being more precise the enumeration procedure can be set in such a way, that it fin- ishes in input-polynomial time. It is achieved by iteratively increasing the threshold θ if the number of the already found patterns is too large. The result is the set of all patterns with ∆ ≥ θ. However, if θ is automatically set to be too high, the procedure still finishes in input-polynomial time but the result set is empty. 4 https://www.kaggle.com/c/titanic Fig. 1. The lattice of concepts with ∆ ≥ 90 for Titanic dataset. means that starting for any attribute a from this set, one has ∆({a}↓ ) < 90, i.e., smaller than the threshold. For example, the attribute fare ≥ 8 is introduced to the lattice at the very top concept. It means that |∅↓ | − |{fare ≥ 8}↓ | < 90, i.e., for most of objects fare ≥ 8. Moreover, since the top concept in this lattice has the maximal value of ∆, its extent is the most typical extent for all passengers. Since we cannot make it more precise without excluding less objects than 175. Then we can see that there are 5 concepts below the top concept. They are the only concepts that are significantly different from their children. Accordingly, this concept lattice for ∆ = 90 shows the structure of different groups of passengers (formal concepts with Δ-closure for ∆ = 90. A concept is only shown if there is no more precise description that contains similar number of objects. This setting allows finding and prioritizing the association rules related to a certain set of attributes. 3.3 Δ-Association rules Let us study female subpopulation of the passengers. In particular, we can try to Δ-close the following description sex = female. What Δ should be used? One of the reasonable choice is the maximal Δ such that Δ-closure of sex = female is different from M . For example in Figure 1 no concept with sex = female is found. It means that for ∆ ≥ 90, {sex = female}↓↑∆ = M . If the whole lattice is available it corresponds to the concept with the maximal Δ-measure such that the intent of the concept is a superset of {sex = female}. In the form of Δ-implication it can be written as5 : {sex = female} → {fare ≥ 10}. ∆ It suggests that women rarely buy cheapest tickets. Moreover, since we do not find class ≤ 2 attribute, it means that they can afford the 3rd class, but nevertheless not the cheapest tickets. What about men? {sex = male} → {title = Mr, fare ≥ 8}. ∆ Now we see that the preference for more expensive tickets is missing. However, we see that most of the men are titled ”Mr”. It is not the classical closure assuming that all men are Mr. Indeed, the correspondence between the title and sex is shown in Table 1. Table 1. Correspondence between passenger sex and title Miss Mrs Mr Master Other female 182 128 0 0 4 male 0 0 525 40 12 Let us dive deeper into the title. What do we know about Master-title? {title = Master} → {sex = male, age ≤ 19, fare ≥ 10, class ≥ 2}. ∆ We can see, that Master corresponds to young men from 2nd and 3rd class but not with the cheapest tickets. It is a quite strange combination and a deeper investigation is needed. However, it is not an artefact of the procedure. If we check the original dataset all findings are supported, i.e., they are mostly from the 2nd and 3rd class, but not so cheap. The proposed procedure was useful here only for highlighting such finding. In contrast, for Miss-title no new information is found. Let us now formulate a question about women that had cheap tickets: {sex = female, fare ≤ 10} → {title = Miss, class = 3, fare ≤ 8}. ∆ In fact if decision is to travel cheap, then it is the cheapest option. Similar, answer will be given for men traveled cheap. However, the group of men that traveled cheep is about 7 times larger than the group of women. Thus, if we would have just requested ”who traveled cheap”, than the result would be the group of men. Finally, let us ask how age affects the travel behavior. {age ≤ 25} → {class ≥ 2, 8 ≤ fare ≤ 40}. ∆ 5 For simplicity, the attribute sets are shown with reduction, i.e., if by knowing that fare ≥ 10 we can conclude that fare ≥ 8, the last attribute is not shown. Thus, young people usually travel in the 2nd or 3d class. Similarly, for people aged more than 41 years. {age ≥ 41} → {class ≤ 2, fare ≥ 22}, ∆ i.e., such people usually prefer the 1st or 2nd class and the fare is more than 22. Let us finally show that we can be interested also in combinations of at- tributes. For example, what about women of age more than 41 years? {age ≥ 41, sex = female} → {class = 1, fare ≥ 40, title = Mrs} ∆ So in contrast to generally aged people, women usually travel in the 1st class and they are titled Mrs. 4 Conclusion Based on Titanic dataset Δ-classes of equivalence are shown to be useful for ex- ploratory data analysis. In particular, it can be used to systematize the dataset and to prioritize association rules related to certain requests, e.g., every at- tribute can be associated with the most frequent attributes taken into account co-occurrences of the attributes. Since, Δ-classes of equivalence can be found in polynomial time [3], such approach is suitable for very large datasets. More- over, such approach focuses on the distribution, where the dataset is taken from, rather than the dataset itself. References 1. Aggarwal, C.C., Han, J.: Frequent pattern mining. Springer (2014) 2. Boley, M., Horváth, T., Wrobel, S.: Efficient discovery of interesting patterns based on strong closedness. Statistical Analysis and Data Mining: The ASA Data Science Journal 2(5-6), 346–360 (2009) 3. Buzmakov, A., Kuznetsov, S.O., Napoli, A.: Fast generation of best interval patterns for nonmonotonic constraints. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 157–172. Springer (2015) 4. Dzyuba, V., van Leeuwen, M., De Raedt, L.: Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31(5), 1266– 1293 (2017) 5. Ganter, B., Wille, R.: Formal concept analysis–mathematical foundations (1999) 6. Hu, Q., Imielinski, T.: Alpine: Progressive itemset mining with definite guarantees. In: Proceedings of the SIAM International Conference on Data Mining. pp. 63–71. SIAM (2017) 7. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient mining of association rules using closed itemset lattices. Information systems 24(1), 25–46 (1999) 8. Smets, K., Vreeken, J.: Slim: Directly mining descriptive patterns. In: Proceedings of the 12 SIAM International Conference on Data Mining, Anaheim, California. pp. 236–247 (2012)