Introduction

Homogeneity and Stability in Conceptual Analysis

Paula Brito

mpbrito@fep.up.pt 0

Geraldine Polaillon

geraldine.polaillon@supelec.fr 1 0 Faculdade de Economia & LIAAD/INESC-Porto L.A., Universidade do Porto Rua Dr. Roberto Frias , 4200-464 Porto , Portugal 1 SUPELEC Science des Systemes (E3S) - Departement Informatique Plateau de Moulon , 3 rue Joliot Curie, 91192 Gif-sur-Yvette cedex , France

This work comes within the eld of data analysis using Galois lattices. We consider ordinal, numerical single or interval data as well as data that consist on frequency/probability distributions on a nite set of categories. Data are represented and dealt with on a common framework, by de ning a generalization operator that determines intents by intervals. In the case of distribution data, the obtained concepts are more homogeneous and more easily interpretable than those obtained by using the maximum and minimum operators previously proposed. The number of obtained concepts being often rather large, and to limit the in uence of atypical elements, we propose to identify stable concepts using interval distances in a cross validation-like approach.

Introduction

This work concerns multivariate data analysis using Galois concept lattices. Let E = f!1; : : : ; !ng be the set of elements to be analyzed, described by p variables Y1; : : : ; Yp. In this paper we consider the speci c case where the variables Yj are numerical (real or interval-valued), ordinal and modal. Modal variables allow associating with each element of E a probability/frequency distribution on an underlying nite set of categories (see [ 9 ]).

The use of Galois lattices in Data Analysis was rst introduced by Barbut and Monjardet, in the seventies of last century [ 2 ] and then further developed and largely spread out by the work of R. Wille and B. Ganter (see, e.g., [ 6 ]). Let (A; 1) and (B; 2) be two ordered sets. A Galois connection is a pair (f; g), where f is a mapping f : A ! B, g is a mapping g : B ! A, such that f and g are antitone, and both h = gof and h0 = f og are extensive; h and h0 are then closure operators. The mapping f de nes the intent of a set S E, and the mapping g that allows obtaining the extent in E associated with a set of attributes T O, where O is the set of the considered (binary) attributes. The couple (f; g) then constitutes a Galois connection between (P (E); ) and (P (O); ). A concept is de ned as a couple (S; T ) where S E; T O; S = g(T ) and T = f (S), i.e., we have h(S) = S; S is the extent of the concept and T its intent. This approach has been applied to non-binary variables, but in this case data are generally submitted to a previous \binarization", by performing a binary coding of the data array; for numerical or ordinal variables Y , attributes of the form \Y for each observed value x, are considered. x,"

In [ 3 ] this approach has been extended by de ning directly the intent of a set of elements; which has allowed obtaining, for each variable type (classical or otherwise) appropriate couples of mappings (f; g) forming a Galois connection. This has the advantage of allowing analyzing the data directly as it is presented, without imposing any sort of binary pre-coding, which may, and generally does, drastically increase the size of the data array to be analyzed. Galois lattices where intents are obtained by union and by intersection are obtained. This approach has been further extended to modal variables (see [ 4 ]). The case of ordinal variables has been dealt with in [ 11 ], using an approach similar to that of [ 4 ] for modal variables.

Ganter and Kuznetsov [ 5 ] proposed a general construction, called pattern structures, which allows for arbitrary descriptions with a semilattice operation on them; since union and intersection of intervals de ne semilattices, they make respective pattern structures. An application on gene expression data is presented in [ 7 ].

Here, we consider a common framework for numerical (real or interval-valued), ordinal and modal variables, by de ning a generalization operator that determines the intent in the form of vectors of intervals. For ordinal and modal (i.e., distribution-valued) variables the obtained concepts are more homogeneous and therefore easier to interpret than those obtained by applying the minimum and maximum operators, as previously proposed. In the next sections, we detail how generalization of a set of elements is performed for each variable type.

The number of obtained concepts being often rather large, we propose to identify stable concepts (see also [ 8 ] and [ 12 ]), using distances designed for interval data. The criteria is that the intent of a concept should not be too di erent from those obtained by sequentially removing one element of the extent at a time - which would reveal that this particular element is provoking a drastic change in the concepts' intent. Should it occur, the concept would be considered to be non-stable.

In the case of multi-valued data, other approaches of lattice reduction, directly applied to the concept lattice, have been proposed in [ 1 ] and [ 10 ]. These two approaches rely on the same idea of merging together similar attribute values (in respect to a given threshold), and thereby reducing the number of concepts.

The remainder of the paper is organized as follows. Section 2 describes the generalization procedure for real and interval-valued variables, which is extended in Section 3 to modal variables. In Section 4 a common generalization approach by vectors of intervals is presented. In Section 5 the problem of concept stability is considered, and a method using interval distances is proposed, which allows addressing the question of lattice reduction. Section 6 concludes the paper, opening paths for future research.

Real and interval-valued variables

Let E = f!1; :::; !ng be the set of n elements or objects to be analyzed, and Y1; : : : ; Yp real or interval-valued variables such that Yj (!i) = [lij ; uij ]. We shall consider real-valued variables as a special case of interval-valued ones; it is therefore equivalent to write Yj (!i) = x or Yj (!i) = [x; x].

Let A = f!1; : : : ; !hg E. Generalization by union is de ned (see [ 3 ]) by the mapping f : P (E) ! Ip where I is the set of intervals of IR endowed with the inclusion order, such that f (A) = (I1; : : : ; Ip), with Ij = [M in flij g ; M ax fuij g], !i 2 A, j = 1; : : : ; p, i.e., for each j = 1; : : : ; p; Ij is the minimum interval (for the inclusion order) that covers all values taken by the elements of A for variable Yj . Let g : Ip ! P (E) be the mapping de ned as g((I1; : : : ; Ip)) = = f!i 2 E : Yj (!i) Ij ; j = 1; : : : ; pg, i.e., the set of elements of E taking values within Ij ; for j = 1; : : : ; p. The couple (f; g) is a Galois connection.

Likewise, we may generalise by intersection de ning f and g as follows: f : P (E) ! Ip, f (A) = (I1; : : : ; Ip), with Ij = [M ax flij g ; M in fuij g] if M ax flij g M in fuij g ; !i 2 A, Ij = otherwise (i.e., the largest interval contained in all intervals taken by the elements of A for variable Yj , which may be empty), for j = 1; : : : ; p, and g : Ip ! P (E) with g ((I1; : : : ; Ip)) = f!i 2 E : Yj (!i) Ij ; j = 1; : : : ; pg (the set of elements of E taking intervalvalues that contain Ij ;) for j = 1; : : : ; p. The couple (f ; g ) forms also a Galois connection.

Example 1:

Consider three persons, Ann, Bob and Charles characterized by two variables, age and amount of time (in minutes) necessary to go to work (which varies from day to day, and is therefore represented by an interval-valued variable), as presented in Table 1.

Age Time Ann 25 [15; 20] Bob 32 [25; 30]

Charles 40 [10; 20]

Let A = fBob,Charlesg. Generalization by the union leads to f (A) = ([32; 40]; [10; 30]), describing people who are between 32 and 40 years old and take 10 to 30 minutes to go to work; in this dataset people meeting this description are given by g(f (A)) = g(([32; 40]; [10; 30])), i.e., fBob, Charlesg = A. Here, (fBob, Charlesg; ([32; 40]; [10; 30])) is a concept.

Modal variables

Yj (!i) = Two Galois connections may also be de ned for the case of modal variables (see [ 4 ]). Let Y1; : : : ; Yp be p modal variables, Oj = mj1; : : : ; mjkj the set of kj possible categories of variable Yj , Mj the set of distributions de ned on Oj , for j = 1; : : : ; p, and M = M1 : : : Mp. For variable Yj and element !i 2 E, n

mj1(pj!1i ); : : : ; mjkj (pj!kij )o, where pj!ki` is the probability/frequency associated with category mj` (` = 1; : : : ; kj ) of variable Yj , and element !i. Let A = f!1; : : : ; !hg E.

To generalise by the maximum we take, for each category mj`, the maximum of its probabilities/frequencies in A. Let f : P (E) ! M , such that f (A) = (d1; : : : ; dp), with dj = fmj1(tj1); : : : ; mjkj (tjkj )g; where tj` = M axfpj!`i ; !i 2 Ag; ` = 1; : : : ; kj . The intent of a set A E is then to be interpreted as \objects with at most tj` cases presenting category mj`; ` = 1; : : : ; kj ; j = 1; : : : ; p". The couple (f; g) with g : M ! P (E) de ned as, for dj = fmj1(pj1); : : : ; mjkj (pjkj )g; g ((d1; : : : ; dp)) = n!i 2 E : pj!`i a Galois connection. g((d1; : : : ; dp)) = n!i 2 E : pj!`i pj`; ` = 1; : : : ; kj ; j = 1; : : : ; po, forms a Galois connection.

Similarly, we may generalise by the minimum taking for each category the minimum of its probabilities/frequencies. Let f : P (E) ! M , f (A) = (d1; : : : ; dp), with dj = fmj1(vj1); : : : ; mjkj (vjkj )g; where vj` = M infpj!`i ; !i 2 Ag; ` = 1; : : : ; kj . The intent of a set A E is now interpreted as \objects with at least vj` cases presenting category mj`; ` = 1; : : : ; kj ; j = 1; : : : ; p".

The couple (f ; g ) with g : M ! P (E) such that, for dj = fmj1(pj1); : : : ; mjkj (pjkj )g; pj`; ` = 1; : : : ; kj ; j = 1; : : : ; po forms likewise

Example 2:

Consider four groups of students for each of which a categorical mark is given, according to the following scale: a: mark < 10, b: mark between 10 and 15, c: mark > 15 as summarized in Table 2.

The intent, obtained by the maximum operator, of the set formed by groups 1 and 2, is fa(0:3); b(0:6); c(0:4)g and is interpreted as \students' groups with at most 30% of marks a, at most 60% of marks b and at most 40% of marks c". The corresponding extent comprehends groups 1, 2, 3 and 4. If, alternatively, we determine the intent of the same set by the minimum operator, we obtain fa(0:2); b(0:3); c(0:2)g, to be read as \students' groups with at least 20% of marks a, at least 30% of marks b and at least 20% of marks c", whose extent is formed by groups 1, 2 and 5. 4

A common approach: generalization by intervals

We now present a unique framework allowing to perform generalization for numerical (real or interval-valued) variables, ordinal variables and modal variables, based on generalization by intervals.

For numerical (real or interval-valued) data, we are in the above mentioned case of generalization by taking the union.

For modal variables, it amounts to consider, for each category, an interval corresponding to the range of its probability/frequency. In fact, it has often been observed that generalization either by the maximum or by the minimum, as de ned in Section 3, may quickly lead to over-generalization. As a consequence, f (A), A E, is not very informative.

Let MjI = fmj`(Ij`); ` = 1; : : : ; kj g; mj` 2 Oj ; Ij` [0; 1] and M I = M1I : : : MpI . Generalization is now de ned as

f I : P (E) ! M I f I (A) = (d1; : : : ; dp) with dj = fmj1(Ij1); : : : ; mjkj (Ijkj )g; where Ij` = h M infpj!`i g; M axfpj!`i gi ; !i 2 A; ` = 1; : : : ; kj , j = 1; : : : ; p and

gI : M I ! E g((d1; : : : ; dp)) = n!i 2 E : pj!`i 2 Ij`; ` = 1; : : : ; kj ; j = 1; : : : ; p o The so-de ned couple of mappings (f I ; gI ) forms a new Galois connection.

On the data of Example 2, generalization by intervals of groups 1 and 2 provides the intent fa [0:2; 0:3] ; b [0:3; 0:6] ; c [0:2; 0:4]g, to be read as \students' groups having between 20% and 30% cases of mark a, between 30% and 60% cases of mark b and between 20% and 40% cases of mark c" and whose extent now only contains groups 1 and 2.

The case of ordinal variables has been addressed in [ 11 ], performing generalization either using the maximum or the minimum. To allow for more exibility, the author proposes to choose the operator individually for each variable. Nevertheless, one of these generalization operators must be chosen in each case, and over-generalization is not prevented. Our proposal for this type of variables, is to generalise a set A E considering, no longer a minimum or a maximum, but rather an interval of ordinal values.

Example 3: Consider the classi cations given by four cinema critics while evaluating three movies, Movie 1, Movie 2 and Movie 3 as given in Table 3.

The intent obtained by using the maximum operator of the group formed by critics 1 and 2 is (5; 5; 4), to be interpreted as \critics giving at most mark 5 to Movie 1, at most mark 5 to Movie 2 and at most mark 4 to Movie 3" - which is obviously too general and would cover almost everyone; in this dataset the corresponding extent contains critics 1, 2, 3 and 4. Therefore, the class formed by critics 1 and 2, who present a similar behavior, does not correspond to a concept. The intent obtained by using the minimum operator of the group formed by critics 3 and 4 is (1; 1; 1), to be read \critics giving at least mark 1 to Movie 1, at least mark 1 to Movie 2 and at least mark 1 to Movie 3" - which would cover every critic; its extent in this dataset consists again of critics 1, 2, 3 and 4. Here again, the class formed by critics 3 and 4, who give quite similar marks, does not correspond to a concept. If we now perform generalization by interval-vectors of the group formed by critics 1 and 2, we obtain the intent ([5; 5] ; [4; 5] ; [4; 4]); likewise for the group formed by critics 3 and 4, we have ([1; 2] ; [1; 2] ; [1; 2]); in the rst case we are clearly referring to critics giving high marks while in the second case we describe critics giving low marks to all movies. The corresponding extents no longer contain other critics, presenting a rather di erent pro le from those considered each time. Furthermore, both (fCritic 1, Critic 2g; ([5; 5] ; [4; 5] ; [4; 4]) and (fCritic 3, Critic 4g; ([1; 2] ; [1; 2] ; [1; 2]) are concepts. When determining concepts, according to the minimum or the maximum operators, e.g. in a clustering context, there is therefore a risk of forming heterogeneous clusters, since over-generalization may lead to a too large extent. By taking interval-vectors of observed values, the over-generalization problem is avoided. To conclude this section, we now present a more general example, with variables of the di erent considered types.

Example 4:

Consider the data in Table 4, where 4 persons are described by their age, a real-valued variable, time (in minutes) they take to go to work, an intervalvalued variable, the means of transportation used, a modal variable, and their classi cations given to three newspapers, A, B and C (ordinal variable).

Age 25 40 32 58 Concepts are theoretically very interesting, and do provide rich information on the values shared by subsets of elements of the set under study. However, the number of concepts of a data array is often rather large, even for relatively low cardinals of the sets of elements and variables. This fact makes the analysis and interpretation of results a bit delicate. It is often to be noticed that when analyzing the concepts generated by numerical or modal variables, groups of concepts appear which are quite similar. This may be due to noise or minor di erences, generally not pertinent. The idea is therefore to extract only those concepts which are representative of these groups of similar concepts, so as to obtain a more concise representation with signi cantly homogeneous concepts.

Several solutions may be pointed out for this objective. We will focus on the notion of stability, as introduced in [ 8 ] and [ 12 ], which evaluates the amount of information of the intent that depends on speci c objects of the concept's extent. Formally, the stability of a concept is de ned as the probability of keeping its intent unchanged while deleting arbitrarily chosen objects of its extent.

When analyzing data described by numerical (real or interval-valued), ordinal or modal variables, and generalizing using interval-vectors (as described in the previous sections), we shall apply a similar approach to each formed concept, but introducing a distance measure. The objective being to retain the homogeneous concepts, it is wished to avoid that a single element of the concepts' extent produces an important increase in the intent's intervals' ranges.

To identify the stable concepts, a threshold depending on the maximum distance is de ned (so as no to be dependent from the variables' scales). A concept is said to be \stable" if the distance between the intent obtained by removing one element of the extent at a time, and its original intent, is not above the given threshold. This is in fact a cross-validation-like approach, in that one element of the extent is removed at a time, and the resulting intent is compared with the original one.

When data have an interval form, interval distances should be used. Different measures are available in the literature; we will focus on three interval distance measures: the Hausdor distance, the interval Euclidean distance and the interval City-Block distance.

Let Ii = [li; ui] and Ih = [lh; uh] be two intervals we wish to compare. The Hausdor distance dH , the interval Euclidean distance d2 and the interval CityBlock distance d1 between Ii and Ih are respectively dH (Ii; Ih) = M ax ffjli lhj ; jui

uhjg d2(Ii; Ih) = p(li d1(Ii; Ih) = jli lh)2 + (ui

uh)2 lhj + jui uhj : The Hausdor distance between two sets is the maximum distance of a set to the nearest point in the other set, i.e., two sets are close in terms of the Hausdor distance if every point of either set is close to some point of the other set. Interval Euclidean and City-Block distances are just the counterparts of the corresponding distances for real values; if we embed the interval set in IR2, where one dimension is used for the lower and the other for the upper bound of the intervals, then these distances are just the Euclidean and City-Block distances between the corresponding points in the two-dimensional space.

Let C = (A; D) be a concept, where A = f!1; : : : ; !hg E is its extent and D = (I1; : : : ; Ip) is its intent, D = f (A). The considered criterion is then the distance between D et D i where D i is the intent of A without !i, D i = f (A n f!ig); i = 1; : : : ; h, de ned by: = M axf (D; D i); !i 2 Ag, measuring the dissimilarity between interval-vectors.

Let d be the distance (according to the chosen measure) between the intervals corresponding to variable Yj in a concept's intent. Two options may then be foreseen, whether it is wished to consider the maximal or the average distance on the intervals de ning the intents: 1. 2.

Max(D; D i) = M axfd(Ij ; Ij i)g, j indexing the variable set Yj ; j = 1; : : : ; p in the case of numerical and ordinal variables, and the global category set O = O1 [ : : : [ Op in the case of p modal variables;

Mean(D; D i) = M eanfd(Ij ; Ij i)g, j as in 1.

A concept C = (A; D) is then considered to be stable if . This approach allows keeping only the stable, and therefore more representative, concepts, avoiding the e ect of outlier observations. 6

Illustrative application

Consider again classi cations given by cinema critics evaluating three movies, Movie 1, Movie 2 and Movie 3 where Yj (Critici) is the mark given by Critic i to Movie j, i = 1; : : : ; 5; j = 1; 2; 3, as given in Table 5.

Tables 6 and 7 list the concepts obtained when the Minimum and the Maximum generalization operators are used, respectively.

The concepts (except for the empty extent one) obtained from this data table, using generalization by intervals, i.e., for A E; f (A) = (I1; I2; I3); with Ij = [M in fYj (Critici)g ; M ax fYj (Critici)g], Critici 2 A; j = 1; 2; 3, are listed in Table 8.

We notice that all the concepts obtained using the Minimum or the Maximum operator are concepts for the interval generalization, although with a di erent meaning, given the di erent intent mapping. As discussed before, even in this small example it may be observed that concepts obtained using the Minimum or the Maximum operator often present a rather general intent, thus leading to over-generalization in the concept formation. Consider, for instance, the concept (f1g , (Movie 1 3 , Movie 2 2 , Movie 3 3)) in Table 6, it indicates that Critic 1 gives high marks to each movie, which is not really the case, whereas the concept (f1g , (Movie 1 2 [3; 3] , Movie 2 2 [2; 2] , Movie 3 2 [3; 3])) in Table 8 gives a much more accurate description of the concepts's extent. Also, concept (f3g , (Movie 1 5 , Movie 2 5 , Movie 3 1)) in Table 7 describes Critic 3 as giving any marks to Movies 1 and 2, and low marks to Movie 3; using interval generalization we learn that the marks given by Critic 3 to Movies 1 and 2 are the highest and non other. Consider now concept (f3;4g , (Movie 1 4 , Movie 2 3 , Movie 3 1)) in Table 6: the intent reports any mark for Movie 3 (in particular, high marks are possible); if we use interval generalization instead we obtain the concept (f3;4g , (Movie 1 2 [4;5] , Movie 2 2 [3;5] , Movie 3 2 [1;2] which more accurately describes the observed situation.

We now compare the concepts retained as stable with each of the three distances, using both Max and Mean, and a threshold value of 1 and 2. The identi ed stable concepts in each case, represented by the corresponding extent, are listed in Table 9.

Distance Criterion Threshold dH Max 1

As it may be seen from Table 9, for all distances and both criteria, a demanding threshold identi es a small number of stable concepts, therefore leading to an important reduction in the number of retained concepts; if we use a more liberal threshold, a larger number of concepts are retained as stable, as was to be expected. The maximum criterion is naturally more strict than the mean, which retains more concepts as stable, for all distances and both threshold values. Finally, in this example, no important di erence appears between the results obtained for the di erent distance measures. 7

Conclusion

A common generalization procedure, for numerical, ordinal and modal variables, which uses a representation based on interval-vectors is presented. This allows de ning more homogeneous concepts, than generalization operators that use the maximum and/or the minimum. The proposed approach for ordinal variables allows addressing recommendation systems, analyzing preference data tables. It would also be interesting to explore how the proposed generalization operator behaves in a supervised learning context.

The number of obtained concepts being often rather large, a method for identifying stable concepts is proposed, using a cross-validation-like approach. This allows avoiding the e ect of atypical elements in the concepts' formation. Naturally, the value of the used threshold has an important in uence in the rate of concept reduction. The next step will be to explore this methodology for larger data tables, so as to have a more accurate evaluation of its e ciency in concept reduction. Another issue interesting to investigate is the comparaison of the list of concepts with those obtained with a subset of the given variables. This then leads to the problem of variable selection in the context of Galois lattices construction and analysis. As concerns applications, we are particularly interested in analyzing real preference data, for application in recommendation systems.

[1]

Assaghir ,

Kaytoue ,

Messai and A . Napoli ( 2009 ). On the mining of numerical data with Formal Concept Analysis and similarity . In Proc. Societe Francophone de Classi cation, pp. 121 - 124 .

[2] Barbut , M. and

Monjardet ( 1970 ). Ordre et Classi cation , Algebre et Combinatoire , Tomes

I et II

. Paris: Hachette.

[3] Brito , P. ( 1994 ). Order structure of symbolic assertion objects . IEEE Transactions on Knowledge and Data Engineering 6 ( 5 ), 830 { 835 .

[4] Brito , P. and G. Polaillon ( 2005 ). Structuring probabilistic data by Galois lattices . Math. & Sci. Hum. / Mathematics and Social Sciences 169 ( 1 ), 77 { 104 .

[5] Ganter , B. and S.O. Kuznetsov ( 2001 ). Pattern structures and their projections . In: G. Stumme and H. Delugach (Eds.), Proc. 9th Int. Conf. on Conceptual Structures, ICCS'01, Lecture Notes in Arti cial Intelligence , vol. 2120 , pp. 129 - 142 .

[6] Ganter , B. and

Wille ( 1999 ). Formal Concept Analysis , Mathematical Foundations . Berlin: Springer.

[7] Kaytoue , M. ,

S.O.

Kuznetsov ,

Napoli and

Duplessis ( 2011 ). Mining gene expression data with pattern structures in formal concept analysis . Information Sciences , Volume 181 , Issue

, 1989 { 2001 .

[8] Kuznetsov , S. ( 2007 ). On stability of a formal concept . Annals of Mathematics and Arti cial Intelligence 49 ( 1-4 ), 101 { 115 .

[9] Noirhomme-Fraiture , M. and

Brito ( 2011 ). Far beyond the classical data models: Symbolic Data Analysis . Statistical Analysis and Data Mining 4 ( 2 ), 157 { 170 .

[10] Pernelle , N. , M.-C. Rousset , and V. Ventos ( 2001 ). Automatic construction and re nement of a class hierarchy over multi-valued data . In L. De Raedt and A . Siebes (Eds.), Principles of Data Mining and Knowledge Discovery, Lecture Notes in Computer Science , pp. 386 { 398 .

[11] Pfaltz , J. ( 2007 ). Representing numeric values in concept lattices . In J. Diatta,

Eklund and M. Liquiere (Eds.), Proc. Fifth International Conference on Concept Lattices and Their Applications , pp. 260 { 269 .

[12] Roth , C. ,

Obiedkov and

Kourie ( 2008 ). On succint representation of knowledge community taxonomies with Formal Concept Analysis . International Journal of Foundations of Computer Science 19 ( 2 ), 383 { 404 .