-

BicOPT:Biochips Data Clustering algorithm

1Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE)

0 0 National Superior School of Engineers of Tunis (ENSIT), University of Tunis, Tunisia, 2Hihger Institute of Applied Language and Computer Science of Beja, University of Jendouba , Tunisia

Biochips present a new technology that allows to analyze the level of expression of genes, among the techniques that are applicable on this technology is the biclustering. The main objective of the latter is to extract groups of genes taking into account the coherence between all the conditions that characterize them. There are a variety of biclustering algorithms that have already been proposed in the field of biochips. Each of these algorithms differs from the others by a set of characteristics. In this paper, we focus on the BicFinder algorithm, where we propose to make improvements in order to make it faster. In the first place, we will present a fast variant of this algorithm. Then we will present our version of algorithm named BicOPT followed by a set of experiments applied to real data.

biochips biclustering BicFinder BicOPT Evaluation functions experimental study

In a data matrix, we can find links between the set of rows or between the set of columns, or between the set of rows and columns simultaneously. A technique called clustering only allows us to detect the first and second cases. So, this technique remains too simplistic to determine the third case. Another more interesting technique, called simultaneous classification, cross-classification or block classification. It is also referred to as a biclustering [8,9] hence the objective of this approach is to extract the groups of rows while taking into account the consistency with all the columns. This technique can be used in several fields, among which we mention that of Bio-chips.

The input file for a biclustering algorithm of biochip data is a data matrix, where the rows are filled by the names of the genes and the columns are the conditions. So, let a data • • • matrix M where n is the number of rows and m is the number of columns. A bigroupe B is a set of pairs (I, J), with I is a subset of rows of M and J is a subset of M columns, all of these subassemblies have a sub matrix called bigroupe. The aim of the clustering algorithms is to produce a coherent, stable and homogeneous bigroup. The homogeneity criteria vary from one algorithm to another. Generally, the biclustering problem is NP-difficult. We then used heuristic algorithms to construct biclusters close to the optimal. The problem of biclustering can be formulated as following [2]: ( ) = max ( ) (1) with

BBC(M) f is an objective function measuring the quality i.e., the degree of coherence, of a group of bigroupes.

BC(M) : is the set of all groups of possible bigroupes associated with M Madeira et Oliveira [9] propose to classify the algorithms of biclustering according to the approaches used for their construction. These approaches are classified according to five categories [7]: IRCCC (Iterative Row and Column Clustering Combination), DC (Divide and Conquer), GIS (Greedy Iterative Search), EBE (Exhaustive Bicluster Enumeration) et DPI (Distribution Parameter Identification). BicOPT is based on the BicFinder algorithm following the Greedy Iterative Search approach of a polynomial complexity O(n⁴m). So, in this paper we will present in the first place the BicFinder algorithm. In the second place, we will detail our BicOPT contributions and we will pass to the illustrations of the experimental study of our approach, we will end with a conclusion.

BicFinder

BicFinder is a systematic greedy algorithm, its polynomial complexity is equal to O(n⁵m), based on the construction of an acyclic directed graph (DAG). BicFinder allows to extract and produce a set of bigroupes close to what a biologist can do by looking for the maximum homogeneous zones. The stage of generation of bigroupes passes through 4 essential steps first of all the discretization of matrix M in M' (see equation 1), then the construction of DAG from M', then the extraction by applying the function ACSI (see equation 2) and validation using the ASR function (see equation 3). 2: Discretize M using Equation 7 to obtain M'// nⱼ).

Algorithm 1. BicFinder [1]

1: Input: M, α, β ; Output: B

Step of discretization

3: 7: Sort arcs of nᵢ in decreasing order according to 9: Ic=I′ᵢ U {gᵢ,gk}; Jc=J′ᵢ ᴜ {cl,cl+1 with T(M′[i, l] = M′[k, l]) = true}; 10: If ACSIᵢ(Ic, Jc) >= α then Bᵢ = (Ic, Jc) equation 7). 1 [ , ] < [ , + 1] (2) ′[ , ] {−1 [ , ] > [ , + 1]

0 [ , ] = [ , + 1] With i[1, ] and l[1. . − 1] The discretization allows us to know the shape of the gene expression profile (which can be either monotonically increasing or monotonically decreasing ...). 2.2

Construction of DAG

Our graph is associated with the matrix M ', where each node nᵢ has a gene gᵢ. Two nodes nᵢ and nⱼ are connected by an arc if and only if (i> j). CSl ᵢ, ⱼ is assigned for each arc (nᵢ,

ᵢ( ′, ′) = 2 ∗ ∑ ∈ ; ≥ +1 ∑ ∈ ; ≥ +1 |I′′|(|I′′| − 1) ( , , ) (3) (4) Our bigroup starts with an initial arc (MaxCSL ᵢ, ⱼ) and at each iteration we add an arc if and only if ACSIᵢ (I ', J')> = α otherwise we pass to the next arc. 2.4

Evaluation: ASR

The last step used is the evaluation of bigroupes generated by applying ASR function. g0 g1 g2 g3 g4 g5 g6 The DAG is constructed from the matrix M '. The arcs are sorted in decreasing order relative to the weight associated with each edge (with the weight equal to the sum of true).

, ∑ ∈| ′′∑|( |∈ ′′|;−≥ 1+)1 } pᵢⱼ = 1 − 6 ∑ =1( ( ) − ( ))2

( 2 − 1)

A bigroup is valid if its ASR> = β. 2.5

Clustering: K-medoids

After the presentation of the algorithm and the explanation of its operating principle, we describe, in this section, the BicFinder process using an illustrative example. So, we fix the parameter α which controls the extraction and addition of the arc and the parameter β which controls the validation of bigroupes. Let the parameters α = 0.75, β = 0.5. 0 ( , , ) = 0 ( , , ) = 0 ( , , ) = 0 ( , , ) =

CSI(0,1,2) + CSI(0,1,5) + CSI(0,2,5)

3(3 − 1)/2 3 2 2 = 4 + 4 + 4 = 0.58 < α

3 CSI(0,1,2) + CSI(0,1,4) + CSI(0,2,4)

3(3 − 1)/2 3 1 1 = 4 + 4 + 4 = 0.41 < α

3 CSI(0,1,2) + CSI(0,1,6) + CSI(0,2,6)

3(3 − 1)/2 3 1 1 = 4 + 4 + 4 = 0.41 < α

3 CSI(0,1,2) + CSI(0,1,3) + CSI(0,2,3)

3(3 − 1)/2 3 = 4 = 0.25 < α

3 We apply the same processes on the rest of the nodes and we obtain as a result: B= {( {g0, g1, g2}; {c′0, c′1, c′2, c′3, c′4} ) ; ({g3, g4, g6}; {c′0, c′1, c′2, c′3, c′4, c′5})}. Only the bigroups who have a score ASR >= β Will be selected. ({g0, g1, g2}; {c′0, c′1, c′2, c′3, c′4}) > β and ({g3, g4, g6}; {c′0, c′1, c′2, c′3, c′4, c′5})< β. Finally, we obtain: B= {({g0, g1, g2}; {c′0, c′1, c′2, c′3, c′4})}. 3

BicOPT

The BicFinder algorithm has shown better performance compared to other bicluster algorithms [1]. The results obtained prompted us to study and improve this algorithm. The BicFinder algorithm resulted in better performance compared to other bicluster algorithms [1]. These results present a motivation for us to study and improve this algorithm. The temporal complexity of the extraction step is O (n⁵,m) [1], which is rather complex. The second and third equations show that for a single node the minimum complexity time for the extraction step is O (n²m) but we need to browse the whole data file so we have as a time of minimal complexity O (n3m). Our main algorithm is divided into five steps (see algorithm 2): • Discretization • Construction of DAG • Extraction of bigroupes • Evaluation of bigroupes • Results Visualization Algorithm 2. Main program 1: F : File // Initial file 2: M : Integer // Number of columns 3: N : Integer // Number of rows 4: α, β,Ɣ : Integer // parameter 5: Mat : matrix // Matrix after discretization 6: T : Table // table of DAG 7: Begin 8: Mat =Discretization(F) ; 9:T=DagTree(Mat) ; 10:B=Extraction(T,Mat) ; 11:B=Evaluation(B); 12: Visualization(B); 13: End 3.1.2

Function “Extraction” The extraction function uses equation 3. We have already mentioned that this equation has a minimum complexity time which is equal to O (nᶟm), so we tried to implement this equation with a time of complexity less than O (n⁵m ) (See algorithm 3.).

Algorithm 3. Extraction(T ,Mat) 1: A,B,C,Arc : table 2: tab : table //Contains the detected arcs for a bicluster

3: nbrLine: integer //Number of rows for a

bicluster

4: Begin

5: For i :=0 to n-2 do // browse all the lines of the input file 6: Arc[]:= extractionEntier(T[i],’,’) ; // Arc[] : Table contains all the arcs of a selected node

The complexity time is calculated from the for loop nested of the extraction part so we have O (n⁴m + nᶟ + n²m) therefore: O (n⁴m + nᶟ + n²m) = O (n² (n²m + n + m)) = O (n² (n (1 + nm) + m)) 1 is negligible with respect to nm, (N² (n²m + nm)) = O (n² (nm (1 + n)))  we also have 1 is negligible with respect to n, so that O (n⁴m) is obtained as a time of final complexity. The columns of our bigroupe is calculated from the lines obtained from the ACSI function, so in the case of decreasing values of the threshold α, the number of rows increases and The design of the interface of our system offers the user in return the number of columns can decrease until no several advantages, of which we quote that it allows to column is obtained.

Hence the risk of losing this bigroup altogether. We used follow and to visualize all the steps of extraction of bigroupes and also allows him to determine the position of another parameter Ɣ that allows us to limit the number of bigroupe in the matrix of initial data. rows of a bigroup. For a first test, the value of Ɣ is equal to n (where n is the number of rows of the data file), we obtained six biclusters with an execution time equal to 1145660 ms. We changed the value of parameter Ɣ (number of genes generated) to 20 where we obtained ten biclusters with a run time of 6354 ms. So, the addition of this parameter allows the user to win in the execution time and in the number of biclusters obtained. 3.2.2

Display of biclusters

Unlike Command-row interface (CLI) systems that require commands to be stored, GUI systems offer a relatively intuitive approach. Even users without significant training can easily learn the system and use it to achieve their goals. 1 Available on http://arep.med.harvard.edu/biclustering/ If the curves are similar, we can deduce that our bigroup has a strong coherence. 4

Results

following format: The data file used by our program must be structured in the -15 … 125 The first column of the file must contain the name of the genes. The columns are separated with spaces. Each row contains the gene name followed by a set of gene expressions.

However, to test BicOPT's ability to extract different types of biclusters, we used a set of real files : Human B-cell Lymphoma dataset 1 with the size of (4026 rows, 96 columns) and Saccharomyces Cerevisiae dataset2 with the size of (2993 rows, 173 columns).

In order to compare the results obtained by BicOPT with the other algorithms we used the BicAT toolbox [3], this tool contains a set of bicluster algorithms (BiMax [ 11 ], CC [6], ISA [5], OPSM [4], xMotives [ 10 ]). The BicOPT algorithm and all other algorithms are run on an Intel Core I3 2.2 GHz machine and 4 Gb of RAM. The BicOPT settings are set to 0.8 for Alpha and 0.4 for Beta. The test execution time of our algorithm lasted 137.15 minutes. The first bigroup of size (8.57), which is presented by the curve (a) (see Figure 7) and the second bigroup of size (12, 49), is presented by the curve (b) Figure 7). We have associated a curve for each gene, the latter is created as a function of the level of gene expression.

The curves are grouped together in the same frame in order to form an expression profile of a bicluster. According to curve (a) we have a strong gene expression profile with the presence of some noises. In the second curve (b) we also observe a strong gene expression profile but without the presence of noises. The presence of noise in the data file is invaluable in the groups detected by BicOPT. 4.2

Saccharomyces Cerevisiae dataset

After testing a set of simulations, the BicOPT parameters were set to 0.85 for Alpha and 0.7 for Beta. The test execution time of our algorithm is 54.5 minutes. To evaluate the biological relevance of bigroupes detected by our algorithm, we used two web tools, GOTermFinder 3 and FuncAssociate4, to calculate the p-value [13]. More than the p-value is lower more than the bigroup genes are consistent. 4.2.1

GOTermFinder :

GOTermFinder is a well-known web-tool which allows to check the quality of each detected group and to search for the significant terms of gene ontology shared by the selected gene groups. To identify the characteristics that the genes can have in common, we selected three random groups in a random way (see Table 7). We used the GOTermFinder tool to describe the most significant shared terms, respectively, for the biological process, molecular function, and cellular component (see Table 4). For the first bigroup of size (20,101) we have for example Cytoplasmic Translation (95.0%, 3.1E-28), so this bigroup is involved in Cyptoplasmic translation with a frequency of 95.0% (among 20 genes in the first bigroup 19 2 Available on http://people.ee.ethz.ch/~sop/bimax/ SupplementaryMaterial/Datasets/BiologicalValidation/data/s accharomyces/yeast_GOEnrichment,Gasch2000,2944x173.txt 3 Available on http://www.yeastgenome.org/cgibin/GO/goTermFinder.pl 4 Available on http://llama.mshri.on.ca/funcassociate/ belong To this process) and with a p-value equal to 3.1E-28 (very significant value). 4.2.2

FuncAssociate

FuncAssociate is a web application that helps to discover properties enriched in lists of genes or proteins that emerge from the experiment on a large scale [13]. The basic idea is to select 20 biclusters and then to determine whether the set of genes discovered by biclustering algorithms shows a significant enrichment compared to an annotation of genetic ontology (GO) or not (see Figure 8). For values associated with parameter p, BicOPT surpassed the other algorithms with a percentage of 100% followed by Opsm with a percentage 90% for p = 0.0001. The other algorithms also perform reasonably well. The experiments applied to the real data set used prove that our proposed algorithm can identify bigroups with high biological relevance. 5

Conclusion

In this paper, we propose the BicOPT algorithm which presents a new optimized version of the BicFinder algorithm. The complexity time of our algorithm is equal to O (n⁴m), which is less than that of BicFinder. BicOPT allows the extraction and the production of a set of biclusters based on the construction of an acyclic oriented graph (DAG). We added a new GAMMA parameter to limit the gene numbers of each generated bigroup. BicOPT has a graphical interface allowing to manage well the obtained bigroupes We realized different tests on real databases to evaluate the performance of BicOPT. In the realization of this study, we used two web tools GOTermFinder, FuncAssociate and a BicAT application. The experimental study of our approach to biclustering have good results. [1]

S. C. Madeira, A. L. Oliveira, Biclustering Algorithms for Biological Data Analysis: A Survey, IEEE Transactions on Computational

Biology and Bioinformatics , Vol. 1 , N° 1 : ( 2004 ), p24 - 45 .

[10] S. K. T. M. Murali , Extracting conserved gene expression motifs from gene expression data , Pac. Symp. Biocomput , Vol. 8 : ( 2003 ), p77 - 88 .

[11] [12] [13]

Prelic ,

Bleuler ,

Zimmermann , P.

Buhlmann , W.

Gruissem , L.

Hennig , L.

Thiele , and E.

Zitzler . A systematic comparison and evaluation of biclustering methods for gene expression data . Bioinformatics , Vol. 22 , N° 9 : ( 2006 ), p1122 -1129

Y.S.

Son and

Baek . A modified correlation coefficient based similarity measure for clustering time-course gene expression data .

Pattern

Recognition

Letters , Vol. 29 , N° 3 : ( 2008 ), p232 - 242 .

70, N° 2 : ( 2016 ), p129 - 133 .