=Paper= {{Paper |id=Vol-2144/paper1 |storemode=property |title=BicOPT:Biochips Data Clustering Algorithm |pdfUrl=https://ceur-ws.org/Vol-2144/paper1.pdf |volume=Vol-2144 |authors=Faouzi Mhamdi,Ahmed Zammali }} ==BicOPT:Biochips Data Clustering Algorithm== https://ceur-ws.org/Vol-2144/paper1.pdf
                      BicOPT:Biochips Data Clustering algorithm


                             Faouzi Mhamdi1,2                        Ahmed Zammali1
                        faouzi.mhamdi@ensi.rnu.tn                zammaliahmad@gmail.com,

 1Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE)

         National Superior School of Engineers of Tunis (ENSIT), University of Tunis, Tunisia
 2Hihger Institute of Applied Language and Computer Science of Beja, University of Jendouba, Tunisia



                         Abstract                                 matrix M where n is the number of rows and m is the
                                                                  number of columns. A bigroupe B is a set of pairs (I, J), with
    Biochips present a new technology that allows to              I is a subset of rows of M and J is a subset of M columns, all
    analyze the level of expression of genes, among the           of these subassemblies have a sub matrix called bigroupe.
    techniques that are applicable on this technology is          The aim of the clustering algorithms is to produce a
    the biclustering. The main objective of the latter is         coherent, stable and homogeneous bigroup. The
    to extract groups of genes taking into account the            homogeneity criteria vary from one algorithm to another.
    coherence between all the conditions that                     Generally, the biclustering problem is NP-difficult. We then
    characterize them. There are a variety of                     used heuristic algorithms to construct biclusters close to the
    biclustering algorithms that have already been                optimal. The problem of biclustering can be formulated as
    proposed in the field of biochips. Each of these              following [2]:
    algorithms differs from the others by a set of
    characteristics. In this paper, we focus on the                              𝑓(𝐵𝑜𝑝𝑡 ) = max 𝑓(𝐵)                     (1)
    BicFinder algorithm, where we propose to make
    improvements in order to make it faster. In the first
    place, we will present a fast variant of this
                                                                   with   •     BBC(M)
    algorithm. Then we will present our version of                        •     f is an objective function measuring the
    algorithm named BicOPT followed by a set of                                 quality i.e., the degree of coherence, of a
    experiments applied to real data.                                           group of bigroupes.

    Keywords biochips, biclustering, BicFinder,                           •     BC(M) : is the set of all groups of possible
    BicOPT, Evaluation functions, experimental study.                           bigroupes associated with M


1    Introduction                                                 Madeira et Oliveira [9] propose to classify the algorithms of
In a data matrix, we can find links between the set of rows       biclustering according to the approaches used for their
or between the set of columns, or between the set of rows         construction. These approaches are classified according to
and columns simultaneously. A technique called clustering         five categories [7]: IRCCC (Iterative Row and Column
only allows us to detect the first and second cases. So, this     Clustering Combination), DC (Divide and Conquer), GIS
technique remains too simplistic to determine the third           (Greedy Iterative Search), EBE (Exhaustive Bicluster
case. Another more interesting technique, called                  Enumeration)        et    DPI     (Distribution    Parameter
simultaneous classification, cross-classification or block        Identification). BicOPT is based on the BicFinder algorithm
classification. It is also referred to as a biclustering [8,9]    following the Greedy Iterative Search approach of a
hence the objective of this approach is to extract the groups     polynomial complexity O(n⁴m). So, in this paper we will
of rows while taking into account the consistency with all        present in the first place the BicFinder algorithm. In the
the columns. This technique can be used in several fields,        second place, we will detail our BicOPT contributions and
among which we mention that of Bio-chips.                         we will pass to the illustrations of the experimental study of
                                                                  our approach, we will end with a conclusion.
The input file for a biclustering algorithm of biochip data is
a data matrix, where the rows are filled by the names of the
genes and the columns are the conditions. So, let a data
2     BicFinder                                                                           1 𝑖𝑓 𝑀[𝑖, 𝑙] < 𝑀[𝑖, 𝑙 + 1]               (2)
BicFinder is a systematic greedy algorithm, its polynomial                     𝑀′[𝑖, 𝑙] {−1 𝑖𝑓 𝑀[𝑖, 𝑙] > 𝑀[𝑖, 𝑙 + 1]
complexity is equal to O(n⁵m), based on the construction of                               0 𝑖𝑓 𝑀[𝑖, 𝑙] = 𝑀[𝑖, 𝑙 + 1]
an acyclic directed graph (DAG). BicFinder allows to extract
and produce a set of bigroupes close to what a biologist can         With i[1, 𝑛] and l[1. . 𝑚 − 1]
do by looking for the maximum homogeneous zones. The                 The discretization allows us to know the shape of the gene
stage of generation of bigroupes passes through 4 essential          expression profile (which can be either monotonically
steps first of all the discretization of matrix M in M' (see         increasing or monotonically decreasing ...).
equation 1), then the construction of DAG from M', then the
extraction by applying the function ACSI (see equation 2)            2.2    Construction of DAG
and validation using the ASR function (see equation 3).
                                                                     Our graph is associated with the matrix M ', where each
      Algorithm 1. BicFinder [1]                                     node nᵢ has a gene gᵢ. Two nodes nᵢ and nⱼ are connected by
      1: Input: M, α, β ; Output: B                                  an arc if and only if (i> j). CSl ᵢ, ⱼ is assigned for each arc (nᵢ,
      2: Discretize M using Equation 7 to obtain M'//                nⱼ).
      Step of discretization
      3: Build the DAG associated with M'//
      Construction Step
      4: B = Ø // Extraction step
      5: For any nᵢ in the DAG do
      6: I′ᵢ=Ø; J′ᵢ=Ø; // Bi = (I′ᵢ , J′ᵢ)
      7: Sort arcs of nᵢ in decreasing order according to
      the number of true
      8: For any edge (nᵢ,nᵏ) do
      9: Ic=I′ᵢ U {gᵢ,gk}; Jc=J′ᵢ ᴜ {cl,cl+1 with T(M′[i, l] =
      M′[k, l]) = true};                                                                Figure 2. Example of DAG
      10: If ACSIᵢ(Ic, Jc) >= α then Bᵢ = (Ic, Jc)
      11: End                                                        2.3    Extraction: ACSI
      12: B = B U Bi                                                 Is a extraction function based on Concordance Index (CI)
      13: End                                                        [12]. To calculate ACSI, the CSI function must be calculated
      14: For any bigroupe Bi = (I′i , J′i) in B do //
                                                                     for each arc of the graph (Dag) (see equation 3).
      Selection step
      15: If ASR(I′i , J′i) < β then B = B\Bi                              𝐶𝑆𝐼(𝑖, 𝑗, 𝑘)                                            (3)
      16: End                                                                ∑𝑚−1       ′       ′          ′
                                                                              𝑖=1 𝑇(𝑀 [𝑖, 𝑙] = 𝑀 [𝑗, 𝑙] = 𝑀 [𝑘, 𝑙])
      17: Return B                                                         =
                                                                                           MaxCSLᵢ

   Group extraction processes are subdivided into four main
steps (see Figure 1).
                                                                     with i [1. . n − 2], j[2. . n − 1], k[3. . n], 1[1. . m
                                                                                        − 1]and i < j < k


                                                                             𝐴𝐶𝑆𝐼ᵢ(𝐼′, 𝐽′)                                         (4)
                                                                                  ∑𝑗∈𝐼;𝑗≥𝑖+1 ∑𝑘∈𝐼;𝑘≥𝑖+1 𝐶𝑆𝐼(𝑖, 𝑗, 𝑘)
                                                                             =2∗
                                                                                           |I′′|(|I′′| − 1)


                                                                     Our bigroup starts with an initial arc (MaxCSL ᵢ, ⱼ) and at
                                                                     each iteration we add an arc if and only if ACSIᵢ (I ', J')> = α
             Figure 1.BicFinder algorithm process                    otherwise we pass to the next arc.

2.1     Discretization                                               2.4    Evaluation: ASR

To compute ACSI, we must first discretize the initial matrix         The last step used is the evaluation of bigroupes generated
M (I, J), I = {1, 2, ..., n} and J = {1, 2, ..., m} Matrix M '(see   by applying ASR function.
equation 7).
  𝐴𝑆𝑅(𝐼′ , 𝐽′ )                                               (5)
              ∑𝑖∈𝐼′ ∑𝑗∈𝐼′;𝑗≥𝑖+1 𝑝𝑖𝑗 ∑𝑘∈𝐽′ ∑𝑙∈𝐽′;𝑙≥𝑘+1 𝑝𝑖𝑗
  = 2 max {                        ,                      }
                 |𝐼′ |(|𝐼′ | − 1)      |𝐽′|(|𝐽′| − 1)



 with                      6 ∑𝑚    𝑖   𝑖          𝑗
                              𝑘=1(𝑟𝑘 (𝑥𝑘 ) − 𝑟𝑘 (𝑥𝑘 ))
                                                       𝑗
                                                       2      (6)
               pᵢⱼ = 1 −
                                  𝑚(𝑚2 − 1)

   A bigroup is valid if its ASR> = β.

2.5     Clustering: K-medoids                                                 Figure 3. DAG associated with the matrix M '
After the presentation of the algorithm and the explanation
                                                                    For the first node g0 we have CSL (g0) = {(b), (a), (e), (d), (f),
of its operating principle, we describe, in this section, the
                                                                    (c)}. So we take the first two arcs "b" and "a"
BicFinder process using an illustrative example.
                                                                                          CSI(0,1,2)       3/4
So, we fix the parameter α which controls the extraction and        𝐴𝐶𝑆𝐼𝑔0 (𝑏, 𝑎) =                    =         = 0.75 We have ACSIg0 (b,
                                                                                          2(2−1)/2         1
addition of the arc and the parameter β which controls the
                                                                    a) = α so we add the arc "e"
validation of bigroupes. Let the parameters α = 0.75, β = 0.5.
                                                                                           CSI(0,1,2) + CSI(0,1,5) + CSI(0,2,5)
                                                                        𝐴𝐶𝑆𝐼𝑔0 (𝑏, 𝑎, 𝑒) =
                      Table I. Data Matrix M                                                           3(3 − 1)/2
                                                                                           3 2 2
                                                                                              + +
          C0         C1         C2        C3          C4      C5                         = 4 4 4 = 0.58 < α
                                                                                               3
 g0       13         7          5         20          10      -5
 g1       15         10         20        30          -2      15                           CSI(0,1,2) + CSI(0,1,4) + CSI(0,2,4)
                                                                        𝐴𝐶𝑆𝐼𝑔0 (𝑏, 𝑎, 𝑑) =
 g2       15         9          8         20          10      10                                       3(3 − 1)/2
 g3       3          8          10        9           15      4                            3 1 1
                                                                                              + +
 g4       13         15         17        8           3       1                          = 4 4 4 = 0.41 < α
                                                                                               3
 g5       20         8          12        25          27      1
                                                                                           CSI(0,1,2) + CSI(0,1,6) + CSI(0,2,6)
 g6       13         15         17        8           3       1         𝐴𝐶𝑆𝐼𝑔0 (𝑏, 𝑎, 𝑓) =
                                                                                                       3(3 − 1)/2
                                                                                           3 1 1
                                                                                              + +
            Table II. Matrix M' after discretization                                     = 4 4 4 = 0.41 < α
                                                                                               3
                      C0      C1     C2      C3       C4                                   CSI(0,1,2) + CSI(0,1,3) + CSI(0,2,3)
               g0     -1      -1     1       -1       -1                𝐴𝐶𝑆𝐼𝑔0 (𝑏, 𝑎, 𝑐) =
                                                                                                       3(3 − 1)/2
               g1     -1      1      1       -1       1                                     3
               g2     -1      -1     1       -1       0
                                                                                         = 4 = 0.25 < α
               g3     1       1      -1      1        -1                                    3
               g4     1       1      -1      -1       -1            We apply the same processes on the rest of the nodes and
               g5     -1      1      1       1        -1            we          obtain            as              a             result:        B=
               g6     1       1      -1      -1       -1            {(         {g0, g1, g2}; {c ′ 0, c ′ 1, c ′ 2, c ′ 3, c′4}          )        ;
                                                                    ({g3, g4, g6};  {c′0, c 1, c 2, c 3, c 4, c 5})}. Only the bigroups
                                                                                           ′    ′     ′      ′      ′

The DAG is constructed from the matrix M '. The arcs are            who have a score ASR >= β Will be selected.
sorted in decreasing order relative to the weight associated        𝐴𝑆𝑅({g0, g1, g2}; {c ′ 0, c ′ 1, c ′ 2, c ′ 3, c ′ 4}) >          β       and
with each edge (with the weight equal to the sum of true).          𝐴𝑆𝑅({g3, g4, g6}; {c ′ 0, c ′ 1, c ′ 2, c ′ 3, c ′ 4, c ′ 5})< β. Finally, we
                                                                    obtain: B= {({g0, g1, g2}; {c ′ 0, c ′ 1, c ′ 2, c ′ 3, c′4})}.

                                                                    3     BicOPT
                                                                    The BicFinder algorithm has shown better performance
                                                                    compared to other bicluster algorithms [1]. The results
                                                                    obtained prompted us to study and improve this algorithm.
3.1     Optimization                                                  7:    nbrLine:=0 ;
The BicFinder algorithm resulted in better performance                8:    Tab[0]:=arc[0] ;
                                                                      9:    int j:=1 ;
compared to other bicluster algorithms [1]. These results
                                                                      10: While (j< arc.length et x<=Ɣ) //Ɣ Is the
present a motivation for us to study and improve this
                                                                      maximum number of rows in a bigroup
algorithm.
                                                                      11:         nbrline++ ;
                                                                      12:         Tab[nbrLine]:=arc[j] ;
3.1.1     Main Program
                                                                      13:         Eval:=0
The temporal complexity of the extraction step is O (n⁵,m)            14:         For k := 0 to nbrLine do // the
[1], which is rather complex. The second and third                    maximum number of rows in a bigroup = n
equations show that for a single node the minimum                     15:              For z := k+1 to nbrLine do
complexity time for the extraction step is O (n²m) but we             16:                  A[]:=extractionEntier(Tab[k]) ;
need to browse the whole data file so we have as a time of            17:                  B[]:=extractionEntier(Tab[z]) ;
minimal complexity O (n3m). Our main algorithm is divided             18:                  C[]:=extractionEntier(Tab[0]) ;
into five steps (see algorithm 2):                                    19:
                                                                      Eval:=eval+csi(A[1],A[2],B[2])/c[0] ; //
     •        Discretization
                                                                      Calculate the CSI function and CSI complexity =
     •        Construction of DAG                                     O(m)
     •        Extraction of bigroupes                                 20:             End
     •        Evaluation of bigroupes                                 21:         End
                                                                      22:     If (eval/((nbrLine+1)*nbrline)/2