<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BicOPT:Biochips Data Clustering algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>1Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Superior School of Engineers of Tunis (ENSIT), University of Tunis, Tunisia, 2Hihger Institute of Applied Language and Computer Science of Beja, University of Jendouba</institution>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biochips present a new technology that allows to analyze the level of expression of genes, among the techniques that are applicable on this technology is the biclustering. The main objective of the latter is to extract groups of genes taking into account the coherence between all the conditions that characterize them. There are a variety of biclustering algorithms that have already been proposed in the field of biochips. Each of these algorithms differs from the others by a set of characteristics. In this paper, we focus on the BicFinder algorithm, where we propose to make improvements in order to make it faster. In the first place, we will present a fast variant of this algorithm. Then we will present our version of algorithm named BicOPT followed by a set of experiments applied to real data.</p>
      </abstract>
      <kwd-group>
        <kwd>biochips</kwd>
        <kwd>biclustering</kwd>
        <kwd>BicFinder</kwd>
        <kwd>BicOPT</kwd>
        <kwd>Evaluation functions</kwd>
        <kwd>experimental study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In a data matrix, we can find links between the set of rows
or between the set of columns, or between the set of rows
and columns simultaneously. A technique called clustering
only allows us to detect the first and second cases. So, this
technique remains too simplistic to determine the third
case. Another more interesting technique, called
simultaneous classification, cross-classification or block
classification. It is also referred to as a biclustering [8,9]
hence the objective of this approach is to extract the groups
of rows while taking into account the consistency with all
the columns. This technique can be used in several fields,
among which we mention that of Bio-chips.</p>
      <p>The input file for a biclustering algorithm of biochip data is
a data matrix, where the rows are filled by the names of the
genes and the columns are the conditions. So, let a data
•
•
•
matrix M where n is the number of rows and m is the
number of columns. A bigroupe B is a set of pairs (I, J), with
I is a subset of rows of M and J is a subset of M columns, all
of these subassemblies have a sub matrix called bigroupe.
The aim of the clustering algorithms is to produce a
coherent, stable and homogeneous bigroup. The
homogeneity criteria vary from one algorithm to another.
Generally, the biclustering problem is NP-difficult. We then
used heuristic algorithms to construct biclusters close to the
optimal. The problem of biclustering can be formulated as
following [2]:
 (  ) = max  ( )
(1)
with</p>
      <p>BBC(M)
f is an objective function measuring the
quality i.e., the degree of coherence, of a
group of bigroupes.</p>
      <p>BC(M) : is the set of all groups of possible
bigroupes associated with M
Madeira et Oliveira [9] propose to classify the algorithms of
biclustering according to the approaches used for their
construction. These approaches are classified according to
five categories [7]: IRCCC (Iterative Row and Column
Clustering Combination), DC (Divide and Conquer), GIS
(Greedy Iterative Search), EBE (Exhaustive Bicluster
Enumeration) et DPI (Distribution Parameter
Identification). BicOPT is based on the BicFinder algorithm
following the Greedy Iterative Search approach of a
polynomial complexity O(n⁴m). So, in this paper we will
present in the first place the BicFinder algorithm. In the
second place, we will detail our BicOPT contributions and
we will pass to the illustrations of the experimental study of
our approach, we will end with a conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>BicFinder</title>
      <p>BicFinder is a systematic greedy algorithm, its polynomial
complexity is equal to O(n⁵m), based on the construction of
an acyclic directed graph (DAG). BicFinder allows to extract
and produce a set of bigroupes close to what a biologist can
do by looking for the maximum homogeneous zones. The
stage of generation of bigroupes passes through 4 essential
steps first of all the discretization of matrix M in M' (see
equation 1), then the construction of DAG from M', then the
extraction by applying the function ACSI (see equation 2)
and validation using the ASR function (see equation 3).
2: Discretize M using Equation 7 to obtain M'//
nⱼ).</p>
      <sec id="sec-2-1">
        <title>Algorithm 1. BicFinder [1]</title>
        <p>1: Input: M, α, β ; Output: B</p>
      </sec>
      <sec id="sec-2-2">
        <title>Step of discretization</title>
        <p>3:
7: Sort arcs of nᵢ in decreasing order according to
9: Ic=I′ᵢ U {gᵢ,gk}; Jc=J′ᵢ ᴜ {cl,cl+1 with T(M′[i, l] =
M′[k, l]) = true};
10: If ACSIᵢ(Ic, Jc) &gt;= α then Bᵢ = (Ic, Jc)
equation 7).
1   [ ,  ] &lt;  [ ,  + 1]
(2)
 ′[ ,  ] {−1   [ ,  ] &gt;  [ ,  + 1]</p>
        <p>0   [ ,  ] =  [ ,  + 1]
With i[1,  ] and l[1. .  − 1]
The discretization allows us to know the shape of the gene
expression profile (which can be either monotonically
increasing or monotonically decreasing ...).
2.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Construction of DAG</title>
        <p>Our graph is associated with the matrix M ', where each
node nᵢ has a gene gᵢ. Two nodes nᵢ and nⱼ are connected by
an arc if and only if (i&gt; j). CSl ᵢ, ⱼ is assigned for each arc (nᵢ,</p>
        <p>ᵢ( ′,  ′)
= 2 ∗ ∑ ∈ ; ≥ +1 ∑ ∈ ; ≥ +1 
|I′′|(|I′′| − 1)
( ,  ,  )
(3)
(4)
Our bigroup starts with an initial arc (MaxCSL ᵢ, ⱼ) and at
each iteration we add an arc if and only if ACSIᵢ (I ', J')&gt; = α
otherwise we pass to the next arc.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Evaluation: ASR</title>
        <p>The last step used is the evaluation of bigroupes generated
by applying ASR function.
g0
g1
g2
g3
g4
g5
g6
The DAG is constructed from the matrix M '. The arcs are
sorted in decreasing order relative to the weight associated
with each edge (with the weight equal to the sum of true).</p>
        <p>, ∑ ∈| ′′∑|( |∈ ′′|;−≥ 1+)1  }
pᵢⱼ = 1 −
6 ∑ =1(   (  ) −    (   ))2</p>
        <p>( 2 − 1)</p>
        <p>A bigroup is valid if its ASR&gt; = β.
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Clustering: K-medoids</title>
        <p>After the presentation of the algorithm and the explanation
of its operating principle, we describe, in this section, the
BicFinder process using an illustrative example.
So, we fix the parameter α which controls the extraction and
addition of the arc and the parameter β which controls the
validation of bigroupes. Let the parameters α = 0.75, β = 0.5.


0 ( ,  ,  ) =
0 ( ,  ,  ) =
0 ( ,  ,  ) =
0 ( ,  ,  ) =</p>
        <p>CSI(0,1,2) + CSI(0,1,5) + CSI(0,2,5)</p>
        <p>3(3 − 1)/2
3 2 2
= 4 + 4 + 4 = 0.58 &lt; α</p>
        <p>3
CSI(0,1,2) + CSI(0,1,4) + CSI(0,2,4)</p>
        <p>3(3 − 1)/2
3 1 1
= 4 + 4 + 4 = 0.41 &lt; α</p>
        <p>3
CSI(0,1,2) + CSI(0,1,6) + CSI(0,2,6)</p>
        <p>3(3 − 1)/2
3 1 1
= 4 + 4 + 4 = 0.41 &lt; α</p>
        <p>3
CSI(0,1,2) + CSI(0,1,3) + CSI(0,2,3)</p>
        <p>3(3 − 1)/2
3
= 4 = 0.25 &lt; α</p>
        <p>3
We apply the same processes on the rest of the nodes and
we obtain as a result: B=
{( {g0, g1, g2}; {c′0, c′1, c′2, c′3, c′4} ) ;
({g3, g4, g6}; {c′0, c′1, c′2, c′3, c′4, c′5})}. Only the bigroups
who have a score ASR &gt;= β Will be selected.
 ({g0, g1, g2}; {c′0, c′1, c′2, c′3, c′4}) &gt; β and
 ({g3, g4, g6}; {c′0, c′1, c′2, c′3, c′4, c′5})&lt; β. Finally, we
obtain: B= {({g0, g1, g2}; {c′0, c′1, c′2, c′3, c′4})}.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>BicOPT</title>
      <p>The BicFinder algorithm has shown better performance
compared to other bicluster algorithms [1]. The results
obtained prompted us to study and improve this algorithm.
The BicFinder algorithm resulted in better performance
compared to other bicluster algorithms [1]. These results
present a motivation for us to study and improve this
algorithm.
The temporal complexity of the extraction step is O (n⁵,m)
[1], which is rather complex. The second and third
equations show that for a single node the minimum
complexity time for the extraction step is O (n²m) but we
need to browse the whole data file so we have as a time of
minimal complexity O (n3m). Our main algorithm is divided
into five steps (see algorithm 2):
• Discretization
• Construction of DAG
• Extraction of bigroupes
• Evaluation of bigroupes
• Results Visualization
Algorithm 2. Main program
1: F : File // Initial file
2: M : Integer // Number of columns
3: N : Integer // Number of rows
4: α, β,Ɣ : Integer // parameter
5: Mat : matrix // Matrix after discretization
6: T : Table // table of DAG
7: Begin
8: Mat =Discretization(F) ;
9:T=DagTree(Mat) ;
10:B=Extraction(T,Mat) ;
11:B=Evaluation(B);
12: Visualization(B);
13: End
3.1.2</p>
      <p>Function “Extraction”
The extraction function uses equation 3. We have already
mentioned that this equation has a minimum complexity
time which is equal to O (nᶟm), so we tried to implement this
equation with a time of complexity less than O (n⁵m ) (See
algorithm 3.).</p>
      <p>Algorithm 3. Extraction(T ,Mat)
1: A,B,C,Arc : table
2: tab : table //Contains the detected arcs for a
bicluster</p>
      <sec id="sec-3-1">
        <title>3: nbrLine: integer //Number of rows for a</title>
        <p>bicluster</p>
      </sec>
      <sec id="sec-3-2">
        <title>4: Begin</title>
        <p>5: For i :=0 to n-2 do // browse all the lines of
the input file
6: Arc[]:= extractionEntier(T[i],’,’) ; // Arc[] :
Table contains all the arcs of a selected node</p>
        <p>The complexity time is calculated from the for loop nested
of the extraction part so we have O (n⁴m + nᶟ + n²m)
therefore:
O (n⁴m + nᶟ + n²m) = O (n² (n²m + n + m)) = O (n² (n (1 +
nm) + m)) 1 is negligible with respect to nm, (N² (n²m +
nm)) = O (n² (nm (1 + n)))  we also have 1 is negligible
with respect to n, so that O (n⁴m) is obtained as a time of
final complexity.
The columns of our bigroupe is calculated from the lines
obtained from the ACSI function, so in the case of decreasing
values of the threshold α, the number of rows increases and
The design of the interface of our system offers the user
in return the number of columns can decrease until no
several advantages, of which we quote that it allows to
column is obtained.</p>
        <p>Hence the risk of losing this bigroup altogether. We used
follow and to visualize all the steps of extraction of
bigroupes and also allows him to determine the position of
another parameter Ɣ that allows us to limit the number of
bigroupe in the matrix of initial data.
rows of a bigroup.
For a first test, the value of Ɣ is equal to n (where n is the
number of rows of the data file), we obtained six biclusters
with an execution time equal to 1145660 ms. We changed
the value of parameter Ɣ (number of genes generated) to 20
where we obtained ten biclusters with a run time of 6354
ms. So, the addition of this parameter allows the user to win
in the execution time and in the number of biclusters
obtained.
3.2.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Display of biclusters</title>
        <p>Unlike Command-row interface (CLI) systems that require
commands to be stored, GUI systems offer a relatively
intuitive approach. Even users without significant training
can easily learn the system and use it to achieve their goals.
1 Available on http://arep.med.harvard.edu/biclustering/
If the curves are similar, we can deduce that our bigroup has
a strong coherence.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>following format:
The data file used by our program must be structured in the
-15
…
125
The first column of the file must contain the name of the
genes. The columns are separated with spaces. Each row
contains the gene name followed by a set of gene
expressions.</p>
      <p>However, to test BicOPT's ability to extract different types
of biclusters, we used a set of real files : Human B-cell
Lymphoma dataset 1 with the size of (4026 rows, 96
columns) and Saccharomyces Cerevisiae dataset2 with the
size of (2993 rows, 173 columns).</p>
      <p>
        In order to compare the results obtained by BicOPT with the
other algorithms we used the BicAT toolbox [3], this tool
contains a set of bicluster algorithms (BiMax [
        <xref ref-type="bibr" rid="ref3">11</xref>
        ], CC [6],
ISA [5], OPSM [4], xMotives [
        <xref ref-type="bibr" rid="ref2">10</xref>
        ]). The BicOPT algorithm and
all other algorithms are run on an Intel Core I3 2.2 GHz
machine and 4 Gb of RAM.
The BicOPT settings are set to 0.8 for Alpha and 0.4 for Beta.
The test execution time of our algorithm lasted 137.15
minutes.
The first bigroup of size (8.57), which is presented by the
curve (a) (see Figure 7) and the second bigroup of size (12,
49), is presented by the curve (b) Figure 7). We have
associated a curve for each gene, the latter is created as a
function of the level of gene expression.
      </p>
      <p>The curves are grouped together in the same frame in order
to form an expression profile of a bicluster. According to
curve (a) we have a strong gene expression profile with the
presence of some noises. In the second curve (b) we also
observe a strong gene expression profile but without the
presence of noises. The presence of noise in the data file is
invaluable in the groups detected by BicOPT.
4.2</p>
      <sec id="sec-4-1">
        <title>Saccharomyces Cerevisiae dataset</title>
        <p>After testing a set of simulations, the BicOPT parameters
were set to 0.85 for Alpha and 0.7 for Beta. The test
execution time of our algorithm is 54.5 minutes. To evaluate
the biological relevance of bigroupes detected by our
algorithm, we used two web tools, GOTermFinder 3 and
FuncAssociate4, to calculate the p-value [13]. More than the
p-value is lower more than the bigroup genes are consistent.
4.2.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>GOTermFinder :</title>
      <p>GOTermFinder is a well-known web-tool which allows to
check the quality of each detected group and to search for
the significant terms of gene ontology shared by the
selected gene groups. To identify the characteristics that the
genes can have in common, we selected three random
groups in a random way (see Table 7).
We used the GOTermFinder tool to describe the most
significant shared terms, respectively, for the biological
process, molecular function, and cellular component (see
Table 4). For the first bigroup of size (20,101) we have for
example Cytoplasmic Translation (95.0%, 3.1E-28), so this
bigroup is involved in Cyptoplasmic translation with a
frequency of 95.0% (among 20 genes in the first bigroup 19
2 Available on http://people.ee.ethz.ch/~sop/bimax/
SupplementaryMaterial/Datasets/BiologicalValidation/data/s
accharomyces/yeast_GOEnrichment,Gasch2000,2944x173.txt
3 Available on
http://www.yeastgenome.org/cgibin/GO/goTermFinder.pl
4 Available on http://llama.mshri.on.ca/funcassociate/
belong To this process) and with a p-value equal to 3.1E-28
(very significant value).
4.2.2</p>
      <sec id="sec-5-1">
        <title>FuncAssociate</title>
        <p>FuncAssociate is a web application that helps to discover
properties enriched in lists of genes or proteins that emerge
from the experiment on a large scale [13]. The basic idea is
to select 20 biclusters and then to determine whether the
set of genes discovered by biclustering algorithms shows a
significant enrichment compared to an annotation of
genetic ontology (GO) or not (see Figure 8).
For values associated with parameter p, BicOPT surpassed
the other algorithms with a percentage of 100% followed by
Opsm with a percentage 90% for p = 0.0001. The other
algorithms also perform reasonably well. The experiments
applied to the real data set used prove that our proposed
algorithm can identify bigroups with high biological
relevance.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we propose the BicOPT algorithm which
presents a new optimized version of the BicFinder
algorithm. The complexity time of our algorithm is equal to
O (n⁴m), which is less than that of BicFinder. BicOPT allows
the extraction and the production of a set of biclusters based
on the construction of an acyclic oriented graph (DAG). We
added a new GAMMA parameter to limit the gene numbers
of each generated bigroup. BicOPT has a graphical interface
allowing to manage well the obtained bigroupes We
realized different tests on real databases to evaluate the
performance of BicOPT. In the realization of this study, we
used two web tools GOTermFinder, FuncAssociate and a
BicAT application. The experimental study of our approach
to biclustering have good results.
[1]</p>
      <p>S. C. Madeira, A. L. Oliveira, Biclustering
Algorithms for Biological Data Analysis: A
Survey, IEEE Transactions on Computational</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Biology and Bioinformatics</source>
          , Vol.
          <volume>1</volume>
          , N°
          <fpage>1</fpage>
          <lpage>:</lpage>
          (
          <year>2004</year>
          ),
          <fpage>p24</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [10]
          <string-name>
            <surname>S. K. T. M. Murali</surname>
          </string-name>
          ,
          <article-title>Extracting conserved gene expression motifs from gene expression data</article-title>
          ,
          <source>Pac. Symp. Biocomput</source>
          , Vol.
          <volume>8</volume>
          : (
          <year>2003</year>
          ),
          <fpage>p77</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [11] [12] [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Prelic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bleuler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , P.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Buhlmann</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Gruissem</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hennig</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Thiele</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Zitzler</surname>
          </string-name>
          .
          <article-title>A systematic comparison and evaluation of biclustering methods for gene expression data</article-title>
          .
          <source>Bioinformatics</source>
          , Vol.
          <volume>22</volume>
          , N°
          <fpage>9</fpage>
          <lpage>:</lpage>
          (
          <year>2006</year>
          ),
          <fpage>p1122</fpage>
          -1129
          <string-name>
            <given-names>Y.S.</given-names>
            <surname>Son</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          .
          <article-title>A modified correlation coefficient based similarity measure for clustering time-course gene expression data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Pattern</given-names>
            <surname>Recognition</surname>
          </string-name>
          <string-name>
            <surname>Letters</surname>
          </string-name>
          , Vol.
          <volume>29</volume>
          , N°
          <fpage>3</fpage>
          <lpage>:</lpage>
          (
          <year>2008</year>
          ),
          <fpage>p232</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          70, N°
          <fpage>2</fpage>
          <lpage>:</lpage>
          (
          <year>2016</year>
          ),
          <fpage>p129</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>