BicOPT:Biochips Data Clustering algorithm Faouzi Mhamdi1,2 Ahmed Zammali1 faouzi.mhamdi@ensi.rnu.tn zammaliahmad@gmail.com, 1Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) National Superior School of Engineers of Tunis (ENSIT), University of Tunis, Tunisia 2Hihger Institute of Applied Language and Computer Science of Beja, University of Jendouba, Tunisia Abstract matrix M where n is the number of rows and m is the number of columns. A bigroupe B is a set of pairs (I, J), with Biochips present a new technology that allows to I is a subset of rows of M and J is a subset of M columns, all analyze the level of expression of genes, among the of these subassemblies have a sub matrix called bigroupe. techniques that are applicable on this technology is The aim of the clustering algorithms is to produce a the biclustering. The main objective of the latter is coherent, stable and homogeneous bigroup. The to extract groups of genes taking into account the homogeneity criteria vary from one algorithm to another. coherence between all the conditions that Generally, the biclustering problem is NP-difficult. We then characterize them. There are a variety of used heuristic algorithms to construct biclusters close to the biclustering algorithms that have already been optimal. The problem of biclustering can be formulated as proposed in the field of biochips. Each of these following [2]: algorithms differs from the others by a set of characteristics. In this paper, we focus on the 𝑓(π΅π‘œπ‘π‘‘ ) = max 𝑓(𝐡) (1) BicFinder algorithm, where we propose to make improvements in order to make it faster. In the first place, we will present a fast variant of this with β€’ BοƒŽBC(M) algorithm. Then we will present our version of β€’ f is an objective function measuring the algorithm named BicOPT followed by a set of quality i.e., the degree of coherence, of a experiments applied to real data. group of bigroupes. Keywords biochips, biclustering, BicFinder, β€’ BC(M) : is the set of all groups of possible BicOPT, Evaluation functions, experimental study. bigroupes associated with M 1 Introduction Madeira et Oliveira [9] propose to classify the algorithms of In a data matrix, we can find links between the set of rows biclustering according to the approaches used for their or between the set of columns, or between the set of rows construction. These approaches are classified according to and columns simultaneously. A technique called clustering five categories [7]: IRCCC (Iterative Row and Column only allows us to detect the first and second cases. So, this Clustering Combination), DC (Divide and Conquer), GIS technique remains too simplistic to determine the third (Greedy Iterative Search), EBE (Exhaustive Bicluster case. Another more interesting technique, called Enumeration) et DPI (Distribution Parameter simultaneous classification, cross-classification or block Identification). BicOPT is based on the BicFinder algorithm classification. It is also referred to as a biclustering [8,9] following the Greedy Iterative Search approach of a hence the objective of this approach is to extract the groups polynomial complexity O(n⁴m). So, in this paper we will of rows while taking into account the consistency with all present in the first place the BicFinder algorithm. In the the columns. This technique can be used in several fields, second place, we will detail our BicOPT contributions and among which we mention that of Bio-chips. we will pass to the illustrations of the experimental study of our approach, we will end with a conclusion. The input file for a biclustering algorithm of biochip data is a data matrix, where the rows are filled by the names of the genes and the columns are the conditions. So, let a data 2 BicFinder 1 𝑖𝑓 𝑀[𝑖, 𝑙] < 𝑀[𝑖, 𝑙 + 1] (2) BicFinder is a systematic greedy algorithm, its polynomial 𝑀′[𝑖, 𝑙] {βˆ’1 𝑖𝑓 𝑀[𝑖, 𝑙] > 𝑀[𝑖, 𝑙 + 1] complexity is equal to O(n⁡m), based on the construction of 0 𝑖𝑓 𝑀[𝑖, 𝑙] = 𝑀[𝑖, 𝑙 + 1] an acyclic directed graph (DAG). BicFinder allows to extract and produce a set of bigroupes close to what a biologist can With iοƒŽ[1, 𝑛] and lοƒŽ[1. . π‘š βˆ’ 1] do by looking for the maximum homogeneous zones. The The discretization allows us to know the shape of the gene stage of generation of bigroupes passes through 4 essential expression profile (which can be either monotonically steps first of all the discretization of matrix M in M' (see increasing or monotonically decreasing ...). equation 1), then the construction of DAG from M', then the extraction by applying the function ACSI (see equation 2) 2.2 Construction of DAG and validation using the ASR function (see equation 3). Our graph is associated with the matrix M ', where each Algorithm 1. BicFinder [1] node nα΅’ has a gene gα΅’. Two nodes nα΅’ and nβ±Ό are connected by 1: Input: M, Ξ±, Ξ² ; Output: B an arc if and only if (i> j). CSl α΅’, β±Ό is assigned for each arc (nα΅’, 2: Discretize M using Equation 7 to obtain M'// nβ±Ό). Step of discretization 3: Build the DAG associated with M'// Construction Step 4: B = Ø // Extraction step 5: For any nα΅’ in the DAG do 6: Iβ€²α΅’=Ø; Jβ€²α΅’=Ø; // Bi = (Iβ€²α΅’ , Jβ€²α΅’) 7: Sort arcs of nα΅’ in decreasing order according to the number of true 8: For any edge (nα΅’,nᡏ) do 9: Ic=Iβ€²α΅’ U {gα΅’,gk}; Jc=Jβ€²α΅’ ᴜ {cl,cl+1 with T(Mβ€²[i, l] = Mβ€²[k, l]) = true}; Figure 2. Example of DAG 10: If ACSIα΅’(Ic, Jc) >= Ξ± then Bα΅’ = (Ic, Jc) 11: End 2.3 Extraction: ACSI 12: B = B U Bi Is a extraction function based on Concordance Index (CI) 13: End [12]. To calculate ACSI, the CSI function must be calculated 14: For any bigroupe Bi = (Iβ€²i , Jβ€²i) in B do // for each arc of the graph (Dag) (see equation 3). Selection step 15: If ASR(Iβ€²i , Jβ€²i) < Ξ² then B = B\Bi 𝐢𝑆𝐼(𝑖, 𝑗, π‘˜) (3) 16: End βˆ‘π‘šβˆ’1 β€² β€² β€² 𝑖=1 𝑇(𝑀 [𝑖, 𝑙] = 𝑀 [𝑗, 𝑙] = 𝑀 [π‘˜, 𝑙]) 17: Return B = MaxCSLα΅’ Group extraction processes are subdivided into four main steps (see Figure 1). with iοƒŽ [1. . n βˆ’ 2], jοƒŽ[2. . n βˆ’ 1], kοƒŽ[3. . n], 1οƒŽ[1. . m βˆ’ 1]and i < j < k 𝐴𝐢𝑆𝐼ᡒ(𝐼′, 𝐽′) (4) βˆ‘π‘—βˆˆπΌ;𝑗β‰₯𝑖+1 βˆ‘π‘˜βˆˆπΌ;π‘˜β‰₯𝑖+1 𝐢𝑆𝐼(𝑖, 𝑗, π‘˜) =2βˆ— |Iβ€²β€²|(|Iβ€²β€²| βˆ’ 1) Our bigroup starts with an initial arc (MaxCSL α΅’, β±Ό) and at each iteration we add an arc if and only if ACSIα΅’ (I ', J')> = Ξ± Figure 1.BicFinder algorithm process otherwise we pass to the next arc. 2.1 Discretization 2.4 Evaluation: ASR To compute ACSI, we must first discretize the initial matrix The last step used is the evaluation of bigroupes generated M (I, J), I = {1, 2, ..., n} and J = {1, 2, ..., m} Matrix M '(see by applying ASR function. equation 7). 𝐴𝑆𝑅(𝐼′ , 𝐽′ ) (5) βˆ‘π‘–βˆˆπΌβ€² βˆ‘π‘—βˆˆπΌβ€²;𝑗β‰₯𝑖+1 𝑝𝑖𝑗 βˆ‘π‘˜βˆˆπ½β€² βˆ‘π‘™βˆˆπ½β€²;𝑙β‰₯π‘˜+1 𝑝𝑖𝑗 = 2 max { , } |𝐼′ |(|𝐼′ | βˆ’ 1) |𝐽′|(|𝐽′| βˆ’ 1) with 6 βˆ‘π‘š 𝑖 𝑖 𝑗 π‘˜=1(π‘Ÿπ‘˜ (π‘₯π‘˜ ) βˆ’ π‘Ÿπ‘˜ (π‘₯π‘˜ )) 𝑗 2 (6) pα΅’β±Ό = 1 βˆ’ π‘š(π‘š2 βˆ’ 1) A bigroup is valid if its ASR> = Ξ². 2.5 Clustering: K-medoids Figure 3. DAG associated with the matrix M ' After the presentation of the algorithm and the explanation For the first node g0 we have CSL (g0) = {(b), (a), (e), (d), (f), of its operating principle, we describe, in this section, the (c)}. So we take the first two arcs "b" and "a" BicFinder process using an illustrative example. CSI(0,1,2) 3/4 So, we fix the parameter Ξ± which controls the extraction and 𝐴𝐢𝑆𝐼𝑔0 (𝑏, π‘Ž) = = = 0.75 We have ACSIg0 (b, 2(2βˆ’1)/2 1 addition of the arc and the parameter Ξ² which controls the a) = Ξ± so we add the arc "e" validation of bigroupes. Let the parameters Ξ± = 0.75, Ξ² = 0.5. CSI(0,1,2) + CSI(0,1,5) + CSI(0,2,5) 𝐴𝐢𝑆𝐼𝑔0 (𝑏, π‘Ž, 𝑒) = Table I. Data Matrix M 3(3 βˆ’ 1)/2 3 2 2 + + C0 C1 C2 C3 C4 C5 = 4 4 4 = 0.58 < Ξ± 3 g0 13 7 5 20 10 -5 g1 15 10 20 30 -2 15 CSI(0,1,2) + CSI(0,1,4) + CSI(0,2,4) 𝐴𝐢𝑆𝐼𝑔0 (𝑏, π‘Ž, 𝑑) = g2 15 9 8 20 10 10 3(3 βˆ’ 1)/2 g3 3 8 10 9 15 4 3 1 1 + + g4 13 15 17 8 3 1 = 4 4 4 = 0.41 < Ξ± 3 g5 20 8 12 25 27 1 CSI(0,1,2) + CSI(0,1,6) + CSI(0,2,6) g6 13 15 17 8 3 1 𝐴𝐢𝑆𝐼𝑔0 (𝑏, π‘Ž, 𝑓) = 3(3 βˆ’ 1)/2 3 1 1 + + Table II. Matrix M' after discretization = 4 4 4 = 0.41 < Ξ± 3 C0 C1 C2 C3 C4 CSI(0,1,2) + CSI(0,1,3) + CSI(0,2,3) g0 -1 -1 1 -1 -1 𝐴𝐢𝑆𝐼𝑔0 (𝑏, π‘Ž, 𝑐) = 3(3 βˆ’ 1)/2 g1 -1 1 1 -1 1 3 g2 -1 -1 1 -1 0 = 4 = 0.25 < Ξ± g3 1 1 -1 1 -1 3 g4 1 1 -1 -1 -1 We apply the same processes on the rest of the nodes and g5 -1 1 1 1 -1 we obtain as a result: B= g6 1 1 -1 -1 -1 {( {g0, g1, g2}; {c β€² 0, c β€² 1, c β€² 2, c β€² 3, cβ€²4} ) ; ({g3, g4, g6}; {cβ€²0, c 1, c 2, c 3, c 4, c 5})}. Only the bigroups β€² β€² β€² β€² β€² The DAG is constructed from the matrix M '. The arcs are who have a score ASR >= Ξ² Will be selected. sorted in decreasing order relative to the weight associated 𝐴𝑆𝑅({g0, g1, g2}; {c β€² 0, c β€² 1, c β€² 2, c β€² 3, c β€² 4}) > Ξ² and with each edge (with the weight equal to the sum of true). 𝐴𝑆𝑅({g3, g4, g6}; {c β€² 0, c β€² 1, c β€² 2, c β€² 3, c β€² 4, c β€² 5})< Ξ². Finally, we obtain: B= {({g0, g1, g2}; {c β€² 0, c β€² 1, c β€² 2, c β€² 3, cβ€²4})}. 3 BicOPT The BicFinder algorithm has shown better performance compared to other bicluster algorithms [1]. The results obtained prompted us to study and improve this algorithm. 3.1 Optimization 7: nbrLine:=0 ; The BicFinder algorithm resulted in better performance 8: Tab[0]:=arc[0] ; 9: int j:=1 ; compared to other bicluster algorithms [1]. These results 10: While (j< arc.length et x<=Ζ”) //Ζ” Is the present a motivation for us to study and improve this maximum number of rows in a bigroup algorithm. 11: nbrline++ ; 12: Tab[nbrLine]:=arc[j] ; 3.1.1 Main Program 13: Eval:=0 The temporal complexity of the extraction step is O (n⁡,m) 14: For k := 0 to nbrLine do // the [1], which is rather complex. The second and third maximum number of rows in a bigroup = n equations show that for a single node the minimum 15: For z := k+1 to nbrLine do complexity time for the extraction step is O (nΒ²m) but we 16: A[]:=extractionEntier(Tab[k]) ; need to browse the whole data file so we have as a time of 17: B[]:=extractionEntier(Tab[z]) ; minimal complexity O (n3m). Our main algorithm is divided 18: C[]:=extractionEntier(Tab[0]) ; into five steps (see algorithm 2): 19: Eval:=eval+csi(A[1],A[2],B[2])/c[0] ; // β€’ Discretization Calculate the CSI function and CSI complexity = β€’ Construction of DAG O(m) β€’ Extraction of bigroupes 20: End β€’ Evaluation of bigroupes 21: End 22: If (eval/((nbrLine+1)*nbrline)/2