-

Pattern Structures for Identifying Biclusters with Coherent Sign Changes

Nyoman Juniarta

nyoman.juniarta@loria.fr 2

Victor Codocedo

victor.codocedo@inf.utfsm.cl 1

Miguel Couceiro

miguel.couceiro@loria.fr 2

Mehdi Kaytoue

mehdi.kaytoue@insa-lyon.fr 0

Amedeo Napoli

amedeo.napoli@loria.fr 2 0 Univ Lyon , INSA Lyon, CNRS, LIRIS UMR 5205, F-69621 Lyon , France 1 Universidad Técnica Federico Santa María , Chile 2 Université de Lorraine , CNRS, Inria, LORIA, F-54000 Nancy , France

In this paper we are studying the task of finding coherentsign-changes biclusters in a binary matrix. This task can be applied to the interpretation of gene expression data, where such a bicluster represents a set of experiments that affect a set of genes in a consistent way. We start with a binary table and study biclustering methods based on FCA and partition pattern structures. Pattern concepts provide biclusters and their hierarchical relation, which can be used to analyze the profile of genes in the given expression data. Our approach is purely symbolic, so we can detect larger biclusters and work with rather complex data.

biclustering FCA gene expression pattern structures

Gene expression data can be represented as a matrix, where rows and columns represent genes and experiments respectively. Each cell contains the numeric expression level of a given gene under a given experiment. In such data, we can say that an experiment affect a gene by either lowering or raising its expression, according to the gene’s normal level. One may be interested in finding a subset of genes and a subset of experiments, such that the experiments affect the genes in a consistent way. In other words, any two experiments in the subset have always either the same effect or the opposite effect on every gene in the subset. This task corresponds to the mining of coherent-sign-changes (CSC) biclusters.

Biclustering is an important technique aimed at discovering patterns in a matrix representing a dataset. It is related to standard clustering whose main objective is to group the rows based on their similarity. On the other hand, biclustering refers to the problem of discovering submatrices whose cells exhibit similar behavior. This problem is also called co-clustering [ 9 ], where rows and columns are clustered simultaneously.

In this paper, we present a method based on FCA and pattern structures for discovering a specific type of bicluster: coherent-sign-changes bicluster. An existing approach in [ 20 ] can mine this bicluster type, but it is statistical, since its discovery of CSC biclusters is based on the magnitude of the expression changes. Our approach is more symbolic, by taking into account only the direction of the changes, with expectation of detecting larger biclusters. Our FCA-based method also gives us the hierarchical structure of all biclusters, allowing an easier interpretation of the results by experts. Furthermore, pattern structures and AddIntent algorithm allows us to define a threshold of bicluster size, so that we can limit the amount of retrieved biclusters.

This paper is organized as follows. First, we discuss some related work about the discovery of biclusters with coherent sign changes in Section 2. Then, some basic definitions of biclustering are presented in Section 3. We explain our approach in Section 4, and discuss the experiments in Section 5. Finally, we concludes our article and give some research perspectives in Section 6. 2

Related Work

The row–column clustering was introduced in [ 10 ], and Cheng and Church [ 3 ] were the firsts to used the term biclustering while working on gene expression data. A bicluster in [ 3 ] is a subset of genes and a subset of conditions with a high similarity score, statistically measured by calculating variances over all values in the submatrix.

Still in the domain of gene expression data, the algorithm called SAMBA was proposed in [ 20 ] to discover a submatrix where the expressions of a subset of genes significantly changes across a subset of conditions. The first model of SAMBA searches a submatrix where there is a joint change across all genes, without looking whether it is an increase or a decrease. The second model takes into account the direction of the change, such that any two conditions in the submatrix have either always the same effect or always the opposite effect. We call this type of submatrix a coherent-sign-changes bicluster, as denoted in [ 18 ].

Regarding bicluster discovery based on FCA, several methods were proposed. In a binary matrix, dense approximate bicluster discovery was studied in [ 8, 11 ] based on standard FCA. This is similar to mining formal concept, but instead of “exact” concepts, the authors relax the problem such that the “approximate” concepts (having a certain amount of empty cells) can also be detected. For biclustering with similar values in a numerical matrix, Kaytoue et al. in [ 17 ] proposed standard FCA with scaling and interval pattern structures. Triadic Concept Analysis was also studied in [ 15 ] to extract this bicluster type. Furthermore, a partition pattern structure was presented in [ 5, 13 ] for mining bicluster with constant columns.

Biclustering

In this section, we recall the basic background and discuss illustrative examples of the different types of biclusters as described in [ 18 ]. We consider that a dataset is composed of a set of objects G, each of which has values over a set of attributes M . This dataset can be represented as a numerical matrix where each cell indicates the value of an object w.r.t. an attribute, and can be formally written in Def. 1.

Definition 1 (Dataset). A dataset is a numerical context (G; M; I) where G is a set of objects, M is a set of attributes, and I corresponds to m(g), which is the value of m 2 M for object g 2 G.

One may be interested in finding which subset of objects possesses the same values w.r.t. a subset of attributes. Regarding the matrix representation, this is equivalent to the problem of finding a submatrix that has a constant value over all of its elements (example in Table 1a). This task is called biclustering with constant values, which is a simultaneous clustering of the rows and columns of a matrix.

In coherent-sign-changes (CSC) bicluster, the matrix is binary. In this bicluster, each row is correlated (either entirely identical or entirely opposite) to all other rows. In the example in Table 1c, the first row is identical to the second and opposite to the third and fourth. We can also see this bicluster by comparing the columns. In the example, the first column is identical to the second and opposite to the third. Before formalizing the definition of CSC bicluster, first we introduce the notation of column submatrix and its similarity.

Definition 2 (Column submatrix). In a binary dataset (G; M; I), given a set of objects A G and an attribute m 2 M , m(A) is the column submatrix formed by the attribute m over A.

The submatrix mj (A) is equal to mk(A), denoted as mj (A) ' mk(A), if all rows in mj (A) are either entirely identical or entirely opposite to the corresponding rows in mk(A). This can be formally written as: mj (A) ' mk(A) ()8g 2 A : mj (g) = mk(g) or

8g 2 A : mj (g) = :mk(g):

With the previous notation, the definition of a CSC bicluster is given as follows.

Definition 3 (Coherent-sign-changes bicluster). Given a binary dataset (G; M; I), a pair (A; B) (where A G, B M ) is a coherent-sign-changes bicluster if 8mj ; mk 2 B : mj (A) ' mk(A).

In the bicluster discovery, a bicluster can be found entirely within another larger bicluster. We then say that this small bicluster is not maximal. The notion of maximal bicluster is the same for any type of bicluster, and given in the following definition.

Definition 4. (Maximal bicluster) Given a dataset (G; M; I), a bicluster (A; B) is a maximal bicluster if adding an object g 2 GnA to A or an attribute m 2 M nB to B does not result in a bicluster. 4

The Pattern Structures of Signed Partition

In the task of CSC bicluster discovery in a formal context (G; M; I), we propose an approach based on partition pattern structures. Instead of partition of objects in G as described in [ 2, 5 ], here we use partition of attributes in M . It is still similar to an object partition since an attribute partition covers every attribute in M and there is no overlapping between any two partition components.

To formally define our signed partition, first we define the notion of signed attribute and signed partition component as follows.

Definition 5 (Signed attribute). Let M be a set of attributes, m 2 M be an attribute, and 2 f ; +g be a sign. A signed attribute m is an attribute m having a sign .

Definition 6 (Signed partition component). A signed partition component (or sp-component ) c is a subset of M , where each attribute in c is associated to their corresponding sign . Therefore, c = fm1; ; mng.

For example, m1+ is a signed attribute where the sign + is given to m1, and fm1+; m2 ; m4+g is a signed partition component. Since an sp-component contains not only attributes but also their associated sign, we define the equality of two sp-components according to these two aspects as follows.

Definition 7 (SP-component equality). Any two sp-components are equal iff both contain the same set of attributes, and they have either entirely same sign or entirely opposite sign.

Therefore, if we have c1 = fm1+; m2 ; m4+g, c2 = fm1+; m2 ; m4+g, and c3 = fm1 ; m2+; m4 g, then c1 = c2 = c3.

Definition 8 (Signed partition). A signed partition (or s-partition) d is a collection of sp-components, written as d = fc1; ; cng, such that every attribute in M is present in exactly one sp-component.

For example, given M = fm1; ; m4g, then ffm1+; m2 ; m4+g; fm3+gg is a valid signed partition of M . The set of all possible s-partitions is denoted as D. This allows us to create an s-partition mapping : G ! D which assigns an object to an s-partition over M . For an object m, (m) is an s-partition containing only one sp-component. This sp-component contains all attributes in M with the corresponding sign according to the object g. Example from Table 2: (1) (2) (g1) = (g2) = ffm1+; m2+; m3 ; m4 gg (g3) = ffm1 ; m2 ; m3+; m4 gg (g4) = ffm1+; m2+; m3+; m4+gg (g5) = ffm1 ; m2 ; m3 ; m4 gg: (g) = ffmjj jmj 2 M gg where j = mj (g): Notice that since the sp-components in (g4) and (g5) contain the same attributes with entirely opposite sign, according to Def. 7 we have (g4) = (g5). This mapping is formulated as follows: 4.2

Signed Partition Space

For the task of CSC bicluster discovery, here we define relations between any two s-partitions. The set of all possible s-partitions D is a meet-semilattice where we can define the meet of any two s-partitions.

First, we define the notation m(c) as the sign of an attribute m in an spcomponent c. For example, if c = fm1+; m2 ; m3 g, then m1(c) = +. With this notation, we define the similarity (\ ) between any two sp-components as: c1 \ c2 = ffmj 2 c1jmj (c1) = mj (c2)g;

fmj 2 c1jmj (c1) = :mj (c2)gg; where corresponds to the sign of mj in c1, i.e. mj(c1).

In other words, the operator \ between c1 and c2 gives fc12; c1j2g. The c12 represents all attributes who are present in c1 and c2 with the same sign, while c1j2 represents all attributes who are present in c1 and c2, but with opposite sign. The signs in the resulting sp-component are the same as those in the first sp-component. Example:

if cx = fm1+; m2 ; m3 ; m4 g and cy = fm1+; m2 ; m3+; m4+; m5 g; then cx \ cy = ffm1+; m2 g; fm3 ; m4 gg: Since the signs in c1j2 follow the first sp-component, the result of c1 \ c2 could be different to c2 \ c1. This can be resolved by Def. 7 that ensures the commutativity of \ . For example: cx \ cy = ffm1+; m2 g; fm3 ; m4 gg; cy \ cx = ffm1+; m2 g; fm3+; m4+gg; cx \ cy = cy \ cx:

Having defined the similarity of any two sp-components, we can now define the similarity of any two s-partitions. The similarity (or the meet) of two spartitions d1 = fc1 ckg and d2 = fc1 cng, with k = jd1j and n = jd2j, is defined as:

d1 u d2 = fci \ cjj8ci 2 d1; cj 2 d2g; and the order between two s-partitions is given by:

d1 v d2 () d1 u d2 = d1: Let C the set of all sp-components in M , and D is the set of all s-partitions in M . We have \ : C2 ! D and u : D2 ! D. Example from Table 2: (g1) u (g3) = ffm1+; m2+; m3 ; m4 gg u ffm1 ; m2 ; m3+; m4 gg

= ffm4 g; fm1+; m2+; m3 gg: Suppose that d1 = ffm4 g; fm1+; m2+; m3 gg. Then d1 v (g1), d1 v (g2), and d1 v (g3).

In order to define a partial order among d 2 D, the u operator has to be commutative, idempotent, and associative. These properties are shown in the following propositions.

Proposition 1. The operator u is commutative, i.e. d1 u d2 = d2 u d1. Proof. Consider d1 = fc1 cng with n = jd1j and d2 = fc1

ckg with k = jd2j.

d1 u d2 = d2 u d1 fci \ cjj8ci 2 d1; cj 2 d2g = fcj \ cij8ci 2 d1; cj 2 d2g It is previously stated that \ is also commutative. Therefore, both sides of the equation above are equal. (3) (4) Proposition 2. The operator u is idempotent, i.e. d1 u d1 = d1. Since there is no overlap among ci 2 d1, then ci \ cj is an empty set for i 6= j. Therefore : d1 u d1 = fci \ cjj8ci; cj 2 d1 and i = jg = fci \ cij8ci 2 d1g = fcijci 2 d1g Proposition 3. The operator u is associative, i.e. (d1 ud2)ud3 = d1 u(d2 ud3). Proof. Consider d1 = fc1 cng with n = jd1j, d2 = fc1 and d3 = fc1 cqg with q = jd3j. cpg with p = jd2j, (d1 u d2) u d3 =fci \ cjj8ci 2 d1; cj 2 d2g u d3 =fci \ cj \ ckj8ci 2 d1; cj 2 d2; ck 2 d3g =d1 u fcj \ ckj8cj 2 d2; ck 2 d3g =d1 u (d2 u d3)

With the definition of similarity (u) and the associated partial ordering (v) between two s-partitions, we then define the notion of signed partition pattern concept in the following subsection. 4.3

Signed Partition Pattern Structures

Let G a set of objects, M a set of attributes. The lattice of s-partitions of M is (D; u), where : G ! D maps an object to an s-partition as defined in Eq. 1.

A signed partition pattern structure is determined by the triple (G; (D; u); ), where the derivation operators for A G and d 2 D are defined as: A =

G(g); g2A d = fg 2 Gjd v (g)g: (5) (6) (A; d) is a signed partition pattern concept (or spp-concept ) when A = d and d = A. From an spp-concept (A; d), a CSC bicluster is any pair (A; c) where c 2 d (we can ignore the attribute signs in c). All spp-concepts from Table 2 are listed in Table 3. In the concept (fg1; g2; g3g; ffm1+; m2+; m3 g; fm4 gg) for example, we can find the CSC bicluster (fg1; g2; g3g; fm1; m2; m3g). Looking back to the original table, this CSC bicluster means that in A = fg1; g2; g3g, we have m1(A) ' m2(A) ' m3(A) (recall the definition of ' in Section 3).

The order between any two spp-concepts is given by (A1; d1) (A2; d2) () A1 A2 or d2 v d1. Using this order, the lattice of all spp-concepts from Table 2 can be constructed and is shown in Figure 1. It should be noticed that the lattice is readable and interpretable only if its size is small. This lattice is useful not only for understanding the hierarchical structure among all biclusters, but also for detecting maximal biclusters.

By Def. 4, the bicluster (fg1; g2; g4; g5g; fm1; m2g) from Table 2 is not maximal, since we can add g3 that constructs another bicluster (fg1; g2; g3; g4; g5g; fm1; m2g). The non-maximal biclusters can be detected from the concept lattice [ 16 ]. Consider two concepts (A1; d1) and (A2; d2) such that (A1; d1) (A2; d2), and an sp-component c. The bicluster (A1; c) is maximal iff c 2 d1 and c 62 d2.

For example, consider the concept p1 = (fg1; g2; g4; g5g; ffm1+; m2+g; fm3+; m4+gg) and p2 = (fg1; g2; g3; g4; g5g; ffm1+; m2+g; fm3+g; fm4+gg), where p1 p2. We see that the sp-component fm1+; m2+g is in the intent of both p1 and p2. Therefore, the bicluster (fg1; g2; g4; g5g; fm1; m2g) from p1 is not maximal. As described in Section 3, CSC bicluster is a submatrix in a binary matrix. Therefore, given a numerical matrix, it is required to transform it into binary matrix. This can be done by scaling, for example by introducing a threshold, and each numerical value can be transformed to + or based on whether it is above or below the threshold. In a gene expression data for example, a threshold can be the normal expression level for each gene. An expression that is above (or below) this normal level should be transformed to + (or respectively).

In the task of mining formal concepts, the algorithm called AddIntent was proposed in [ 19 ]. This algorithm can be used for any pattern structures by defining the meet (u) and the order (v) between any two descriptions. Having defined the meet in Eq. 3 and the order in Eq. 4, we then use AddIntent to mine sppconcepts in a binary matrix. Furthermore, this algorithm is also effective for building a concept lattice, which is needed in our case to detect the maximality of any CSC bicluster. 5

Experiments

As previously explained in Section 4, CSC biclusters can be found in any sppconcept. Therefore, from a binary matrix, we should retrieve all spp-concepts. To do that, we reuse the AddIntent source code in [ 4 ] by modifying the definition u and v operators. This algorithm also allows us to build the lattice of all concepts. We can reduce the lattice by choosing a threshold that applies to the intent of a concept. This threshold defines the minimal size of an sp-component that an intent should have. Since the lattice construction is performed by a bottom-up approach, this threshold allows to “prune” the lattice.

For example, with = 3, the lattice for Table 2 does not contain the concept (fg1; g2; g4; g5g; ffm1+; m2+g; fm3+; m4+gg)–since none of its sp-components has 3 attributes–as shown in Figure 2.

We tested our method to lymphoma dataset provided in [ 1 ]. It contains the numerical expression levels of 4026 genes over 96 tissues. The objective of CSC bicluster discovery in this dataset is to find a subset of genes that behave in a consistent way over a subset of tissues. For this task, we convert this numerical dataset to binary by assigning and + for the values < 0 and 0 respectively.

The number of concepts and runtime for different thresholds are listed in Table 4. We tested three thresholds: 70, 80, and 90. As shown here, higher can reduce the number of concepts, and consequently reduce the runtime.

For = 70, around 157K concepts are obtained. Among them, only 153K have extent size larger than 1. This means that there are 153K CSC biclusters having at least 70 columns and at least 2 rows. Furthermore, still with = 70, the largest extent size is 8, meaning that among the biclusters with 70 columns, there are no bicluster with > 8 rows.

Higher corresponds to higher number of columns in the biclusters and thus lower number of rows. With = 90, we see that among the biclusters with 90 rows, there are no bicluster with > 3 rows. In this paper we have presented an approach to mine biclusters with coherent sign changes in a binary matrix. We formulated our method based on partition pattern structures. The generated lattice allows us to examine the maximality of discovered biclusters. Another advantage is that we can choose a threshold that defines a minimum number of columns of the biclusters. This is also useful to reduce the number of concepts and the computational time.

The computational time can be further reduced by taking into account the maximality of a bicluster. If the intent of a concept p contains an sp-component c that is present in the intent of p’s superconcept, then this c indicates a nonmaximal bicluster. In this case, c should be removed from the intent of p. However, this may change the definition of u.

Another aspect that should be studied is the possibility of a matrix that has another sign in addition to + and . This new sign can represent a missing value, or in the case of threshold-based transformation, a value that is equal to the threshold. It can be resolved using tolerance relation introduced in [ 14 ], such that a value equal to the threshold should be regarded as similar to both + and . In the case of missing value, it can be resolved by modifying the definition of attribute partition in Def. 8 which permits an attribute to be not present in any sp-component. This modification may consequently require modifications on the definition of meet and order between s-partitions.

Eventually, the CSC bicluster discovery can be applied in a domain besides gene expression data. Frequent gradual itemset mining was studied in [ 7 ] to extract gradual rules from a numerical table, e.g. a hotel price table with 3 attributes: mp for city population, md for distance from city center, and mr for room price. We may find an sp-component fmp+; md ; mr+g. It is related to the rule saying “the more/less mp, the less/more md, then the more/less mr”.

Moreover, some studies ([ 6, 12 ]) show the benefits of biclustering in the recommendation systems. In a user–movie rating matrix for example, a constantcolumn bicluster represents a set of users having the same interest across a set of movies. On the other hand, a CSC bicluster in this matrix represents a set of users having either the same or the opposite interest. This is useful for a new user u: we can recommend movies liked by users similar to u and movies disliked by users opposite to u.

1. Alizadeh , A.A. , Eisen , M.B. , Davis , R.E. , Ma , C. , Lossos , I.S. , Rosenwald , A. , Boldrick , J.C. , Sabet , H. , Tran , T. , Yu , X. , et al.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling . Nature 403 ( 6769 ), 503 ( 2000 )

2. Baixeries , J. , Kaytoue , M. , Napoli , A. : Characterizing functional dependencies in formal concept analysis with pattern structures . Annals of Mathematics and Artificial Intelligence 72 , 129 - 149 ( 2014 )

3. Cheng, Y., Church , G.M.: Biclustering of expression data . In: ISMB . vol. 8 , pp. 93 - 103 ( 2000 )

4. Codocedo , V. , Bosc , G. , Kaytoue , M. , Boulicaut , J.F. , Napoli , A. : A proposition for sequence mining using pattern structures . In: International Conference on Formal Concept Analysis . pp. 106 - 121 . Springer ( 2017 )

5. Codocedo , V. , Napoli , A. : Lattice-based biclustering using partition pattern structures . In: Proceedings of the Twenty-first European Conference on Artificial Intelligence . pp. 213 - 218 . IOS Press ( 2014 )

6. Codocedo-Henríquez , V. : Contributions à l'indexation et à la récupération d'information utilisant l'analyse formelle de concepts . Ph.D. thesis , Université de Lorraine ( 2015 )

7. Di-Jorio , L. , Laurent , A. , Teisseire , M. : Mining frequent gradual itemsets from large databases . In: International Symposium on Intelligent Data Analysis . pp. 297 - 308 . Springer ( 2009 )

8. Gnatyshak , D. , Ignatov , D.I. , Semenov , A. , Poelmans , J.: Analysing online social network data with biclustering and triclustering . In: Proceedings of the “Concept Discovery in Unstructured Data” conference . vol. 871 , pp. 30 - 39 . Citeseer ( 2012 )

9. Govaert , G. , Nadif , M. : Co-clustering. Wiley-IEEE Press ( 2013 )

10. Hartigan , J.A. : Direct clustering of a data matrix . Journal of the american statistical association 67 ( 337 ), 123 - 129 ( 1972 )

11. Ignatov , D.I. , Kuznetsov , S.O. , Poelmans , J.: Concept-based biclustering for internet advertisement . In: Data Mining Workshops (ICDMW) , 2012 IEEE 12th International Conference on. pp. 123 - 130 . IEEE ( 2012 )

12. Ignatov , D.I. , Poelmans , J. , Zaharchuk , V. : Recommender system based on algorithm of bicluster analysis RecBi . arXiv preprint arXiv:1202.2892 ( 2012 )

13. Kaytoue , M. : Traitement de données numériques pas analyse formelle de concepts et structures de patrons . Ph.D. thesis, Université Henri Poincare - Nancy 1 ( 2011 )

14. Kaytoue , M. , Assaghir , Z. , Napoli , A. , Kuznetsov , S.O. : Embedding tolerance relations in formal concept analysis: an application in information fusion . In: Proceedings of the 19th ACM international conference on Information and knowledge management . pp. 1689 - 1692 . ACM ( 2010 )

15. Kaytoue , M. , Kuznetsov , S.O. , Macko , J. , Napoli , A. : Biclustering meets triadic concept analysis . Annals of Mathematics and Artificial Intelligence 70 ( 1-2 ), 55 - 79 ( 2014 )

16. Kaytoue , M. , Kuznetsov , S.O. , Napoli , A. : Biclustering numerical data in formal concept analysis . In: International Conference on Formal Concept Analysis . pp. 135 - 150 . Springer ( 2011 )

17. Kaytoue , M. , Kuznetsov , S.O. , Napoli , A. , Duplessis , S. : Mining Gene Expression Data with Pattern Structures in Formal Concept Analysis . Information Science 181 ( 10 ), 1989 - 2001 ( 2011 )

18. Madeira , S.C. , Oliveira , A.L. : Biclustering algorithms for biological data analysis: a survey . IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 1 ( 1 ), 24 - 45 ( 2004 )

19. van der Merwe , D. , Obiedkov , S. , Kourie , D.: AddIntent: A new incremental algorithm for constructing concept lattices . In: International Conference on Formal Concept Analysis . pp. 372 - 385 . Springer ( 2004 )

20. Tanay , A. , Sharan , R. , Shamir , R.: Discovering statistically significant biclusters in gene expression data . Bioinformatics 18 ( suppl _1), S136 - S144 ( 2002 )