=Paper=
{{Paper
|id=Vol-2144/paper1
|storemode=property
|title=BicOPT:Biochips Data Clustering Algorithm
|pdfUrl=https://ceur-ws.org/Vol-2144/paper1.pdf
|volume=Vol-2144
|authors=Faouzi Mhamdi,Ahmed Zammali
}}
==BicOPT:Biochips Data Clustering Algorithm==
BicOPT:Biochips Data Clustering algorithm Faouzi Mhamdi1,2 Ahmed Zammali1 faouzi.mhamdi@ensi.rnu.tn zammaliahmad@gmail.com, 1Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) National Superior School of Engineers of Tunis (ENSIT), University of Tunis, Tunisia 2Hihger Institute of Applied Language and Computer Science of Beja, University of Jendouba, Tunisia Abstract matrix M where n is the number of rows and m is the number of columns. A bigroupe B is a set of pairs (I, J), with Biochips present a new technology that allows to I is a subset of rows of M and J is a subset of M columns, all analyze the level of expression of genes, among the of these subassemblies have a sub matrix called bigroupe. techniques that are applicable on this technology is The aim of the clustering algorithms is to produce a the biclustering. The main objective of the latter is coherent, stable and homogeneous bigroup. The to extract groups of genes taking into account the homogeneity criteria vary from one algorithm to another. coherence between all the conditions that Generally, the biclustering problem is NP-difficult. We then characterize them. There are a variety of used heuristic algorithms to construct biclusters close to the biclustering algorithms that have already been optimal. The problem of biclustering can be formulated as proposed in the field of biochips. Each of these following [2]: algorithms differs from the others by a set of characteristics. In this paper, we focus on the π(π΅πππ‘ ) = max π(π΅) (1) BicFinder algorithm, where we propose to make improvements in order to make it faster. In the first place, we will present a fast variant of this with β’ BοBC(M) algorithm. Then we will present our version of β’ f is an objective function measuring the algorithm named BicOPT followed by a set of quality i.e., the degree of coherence, of a experiments applied to real data. group of bigroupes. Keywords biochips, biclustering, BicFinder, β’ BC(M) : is the set of all groups of possible BicOPT, Evaluation functions, experimental study. bigroupes associated with M 1 Introduction Madeira et Oliveira [9] propose to classify the algorithms of In a data matrix, we can find links between the set of rows biclustering according to the approaches used for their or between the set of columns, or between the set of rows construction. These approaches are classified according to and columns simultaneously. A technique called clustering five categories [7]: IRCCC (Iterative Row and Column only allows us to detect the first and second cases. So, this Clustering Combination), DC (Divide and Conquer), GIS technique remains too simplistic to determine the third (Greedy Iterative Search), EBE (Exhaustive Bicluster case. Another more interesting technique, called Enumeration) et DPI (Distribution Parameter simultaneous classification, cross-classification or block Identification). BicOPT is based on the BicFinder algorithm classification. It is also referred to as a biclustering [8,9] following the Greedy Iterative Search approach of a hence the objective of this approach is to extract the groups polynomial complexity O(nβ΄m). So, in this paper we will of rows while taking into account the consistency with all present in the first place the BicFinder algorithm. In the the columns. This technique can be used in several fields, second place, we will detail our BicOPT contributions and among which we mention that of Bio-chips. we will pass to the illustrations of the experimental study of our approach, we will end with a conclusion. The input file for a biclustering algorithm of biochip data is a data matrix, where the rows are filled by the names of the genes and the columns are the conditions. So, let a data 2 BicFinder 1 ππ π[π, π] < π[π, π + 1] (2) BicFinder is a systematic greedy algorithm, its polynomial πβ²[π, π] {β1 ππ π[π, π] > π[π, π + 1] complexity is equal to O(nβ΅m), based on the construction of 0 ππ π[π, π] = π[π, π + 1] an acyclic directed graph (DAG). BicFinder allows to extract and produce a set of bigroupes close to what a biologist can With iο[1, π] and lο[1. . π β 1] do by looking for the maximum homogeneous zones. The The discretization allows us to know the shape of the gene stage of generation of bigroupes passes through 4 essential expression profile (which can be either monotonically steps first of all the discretization of matrix M in M' (see increasing or monotonically decreasing ...). equation 1), then the construction of DAG from M', then the extraction by applying the function ACSI (see equation 2) 2.2 Construction of DAG and validation using the ASR function (see equation 3). Our graph is associated with the matrix M ', where each Algorithm 1. BicFinder [1] node nα΅’ has a gene gα΅’. Two nodes nα΅’ and nβ±Ό are connected by 1: Input: M, Ξ±, Ξ² ; Output: B an arc if and only if (i> j). CSl α΅’, β±Ό is assigned for each arc (nα΅’, 2: Discretize M using Equation 7 to obtain M'// nβ±Ό). Step of discretization 3: Build the DAG associated with M'// Construction Step 4: B = Γ // Extraction step 5: For any nα΅’ in the DAG do 6: Iβ²α΅’=Γ; Jβ²α΅’=Γ; // Bi = (Iβ²α΅’ , Jβ²α΅’) 7: Sort arcs of nα΅’ in decreasing order according to the number of true 8: For any edge (nα΅’,nα΅) do 9: Ic=Iβ²α΅’ U {gα΅’,gk}; Jc=Jβ²α΅’ α΄ {cl,cl+1 with T(Mβ²[i, l] = Mβ²[k, l]) = true}; Figure 2. Example of DAG 10: If ACSIα΅’(Ic, Jc) >= Ξ± then Bα΅’ = (Ic, Jc) 11: End 2.3 Extraction: ACSI 12: B = B U Bi Is a extraction function based on Concordance Index (CI) 13: End [12]. To calculate ACSI, the CSI function must be calculated 14: For any bigroupe Bi = (Iβ²i , Jβ²i) in B do // for each arc of the graph (Dag) (see equation 3). Selection step 15: If ASR(Iβ²i , Jβ²i) < Ξ² then B = B\Bi πΆππΌ(π, π, π) (3) 16: End βπβ1 β² β² β² π=1 π(π [π, π] = π [π, π] = π [π, π]) 17: Return B = MaxCSLα΅’ Group extraction processes are subdivided into four main steps (see Figure 1). with iο [1. . n β 2], jο[2. . n β 1], kο[3. . n], 1ο[1. . m β 1]and i < j < k π΄πΆππΌα΅’(πΌβ², π½β²) (4) βπβπΌ;πβ₯π+1 βπβπΌ;πβ₯π+1 πΆππΌ(π, π, π) =2β |Iβ²β²|(|Iβ²β²| β 1) Our bigroup starts with an initial arc (MaxCSL α΅’, β±Ό) and at each iteration we add an arc if and only if ACSIα΅’ (I ', J')> = Ξ± Figure 1.BicFinder algorithm process otherwise we pass to the next arc. 2.1 Discretization 2.4 Evaluation: ASR To compute ACSI, we must first discretize the initial matrix The last step used is the evaluation of bigroupes generated M (I, J), I = {1, 2, ..., n} and J = {1, 2, ..., m} Matrix M '(see by applying ASR function. equation 7). π΄ππ (πΌβ² , π½β² ) (5) βπβπΌβ² βπβπΌβ²;πβ₯π+1 πππ βπβπ½β² βπβπ½β²;πβ₯π+1 πππ = 2 max { , } |πΌβ² |(|πΌβ² | β 1) |π½β²|(|π½β²| β 1) with 6 βπ π π π π=1(ππ (π₯π ) β ππ (π₯π )) π 2 (6) pα΅’β±Ό = 1 β π(π2 β 1) A bigroup is valid if its ASR> = Ξ². 2.5 Clustering: K-medoids Figure 3. DAG associated with the matrix M ' After the presentation of the algorithm and the explanation For the first node g0 we have CSL (g0) = {(b), (a), (e), (d), (f), of its operating principle, we describe, in this section, the (c)}. So we take the first two arcs "b" and "a" BicFinder process using an illustrative example. CSI(0,1,2) 3/4 So, we fix the parameter Ξ± which controls the extraction and π΄πΆππΌπ0 (π, π) = = = 0.75 We have ACSIg0 (b, 2(2β1)/2 1 addition of the arc and the parameter Ξ² which controls the a) = Ξ± so we add the arc "e" validation of bigroupes. Let the parameters Ξ± = 0.75, Ξ² = 0.5. CSI(0,1,2) + CSI(0,1,5) + CSI(0,2,5) π΄πΆππΌπ0 (π, π, π) = Table I. Data Matrix M 3(3 β 1)/2 3 2 2 + + C0 C1 C2 C3 C4 C5 = 4 4 4 = 0.58 < Ξ± 3 g0 13 7 5 20 10 -5 g1 15 10 20 30 -2 15 CSI(0,1,2) + CSI(0,1,4) + CSI(0,2,4) π΄πΆππΌπ0 (π, π, π) = g2 15 9 8 20 10 10 3(3 β 1)/2 g3 3 8 10 9 15 4 3 1 1 + + g4 13 15 17 8 3 1 = 4 4 4 = 0.41 < Ξ± 3 g5 20 8 12 25 27 1 CSI(0,1,2) + CSI(0,1,6) + CSI(0,2,6) g6 13 15 17 8 3 1 π΄πΆππΌπ0 (π, π, π) = 3(3 β 1)/2 3 1 1 + + Table II. Matrix M' after discretization = 4 4 4 = 0.41 < Ξ± 3 C0 C1 C2 C3 C4 CSI(0,1,2) + CSI(0,1,3) + CSI(0,2,3) g0 -1 -1 1 -1 -1 π΄πΆππΌπ0 (π, π, π) = 3(3 β 1)/2 g1 -1 1 1 -1 1 3 g2 -1 -1 1 -1 0 = 4 = 0.25 < Ξ± g3 1 1 -1 1 -1 3 g4 1 1 -1 -1 -1 We apply the same processes on the rest of the nodes and g5 -1 1 1 1 -1 we obtain as a result: B= g6 1 1 -1 -1 -1 {( {g0, g1, g2}; {c β² 0, c β² 1, c β² 2, c β² 3, cβ²4} ) ; ({g3, g4, g6}; {cβ²0, c 1, c 2, c 3, c 4, c 5})}. Only the bigroups β² β² β² β² β² The DAG is constructed from the matrix M '. The arcs are who have a score ASR >= Ξ² Will be selected. sorted in decreasing order relative to the weight associated π΄ππ ({g0, g1, g2}; {c β² 0, c β² 1, c β² 2, c β² 3, c β² 4}) > Ξ² and with each edge (with the weight equal to the sum of true). π΄ππ ({g3, g4, g6}; {c β² 0, c β² 1, c β² 2, c β² 3, c β² 4, c β² 5})< Ξ². Finally, we obtain: B= {({g0, g1, g2}; {c β² 0, c β² 1, c β² 2, c β² 3, cβ²4})}. 3 BicOPT The BicFinder algorithm has shown better performance compared to other bicluster algorithms [1]. The results obtained prompted us to study and improve this algorithm. 3.1 Optimization 7: nbrLine:=0 ; The BicFinder algorithm resulted in better performance 8: Tab[0]:=arc[0] ; 9: int j:=1 ; compared to other bicluster algorithms [1]. These results 10: While (j< arc.length et x<=Ζ) //Ζ Is the present a motivation for us to study and improve this maximum number of rows in a bigroup algorithm. 11: nbrline++ ; 12: Tab[nbrLine]:=arc[j] ; 3.1.1 Main Program 13: Eval:=0 The temporal complexity of the extraction step is O (nβ΅,m) 14: For k := 0 to nbrLine do // the [1], which is rather complex. The second and third maximum number of rows in a bigroup = n equations show that for a single node the minimum 15: For z := k+1 to nbrLine do complexity time for the extraction step is O (nΒ²m) but we 16: A[]:=extractionEntier(Tab[k]) ; need to browse the whole data file so we have as a time of 17: B[]:=extractionEntier(Tab[z]) ; minimal complexity O (n3m). Our main algorithm is divided 18: C[]:=extractionEntier(Tab[0]) ; into five steps (see algorithm 2): 19: Eval:=eval+csi(A[1],A[2],B[2])/c[0] ; // β’ Discretization Calculate the CSI function and CSI complexity = β’ Construction of DAG O(m) β’ Extraction of bigroupes 20: End β’ Evaluation of bigroupes 21: End 22: If (eval/((nbrLine+1)*nbrline)/2