Applying biclustering technique and gene ontology
analysis for gene expression data processing⋆
Sergii Babichev1,2,*,†, Maksym Korobchynskyi3,†, Myhailo Rudenko3,† and Hanna
Batenko 1,†
1 Kherson State University, University street, 27, Kherson, 73000, Ukraine
2 Jan Evangelista Purkyne University in Usti nad Labem, Pasteurova, 15, Usti nad Labem, 400 96, Czech Republic 3

Military Academy named after Eugene Bereznyak, Yria Il’enka street, 81, Kyiv, 04050, Ukraine


                 Abstract
                 This study details the biclustering methods for gene expression data, focusing on the refinement of
                 quality criteria essential for evaluating the generated bicluster structures. An internal biclustering
                 quality criterion is introduced, leveraging mutual information evaluation across both rows and
                 columns within a bicluster. Additionally, the research proposes a novel hybrid biclustering model,
                 which amalgamates the ensemble biclustering algorithm with Bayesian optimization techniques to
                 optimize the algorithm's parameters effectively. This model is grounded on a target objective function
                 derived from the newly proposed quality criterion. Simulations carried out on gene expression data
                 from patients afflicted with various cancer types demonstrate the efficacy of the model. Specifically,
                 the application of the mutual information-based criterion within the objective function results in the
                 formation of a bicluster structure comprising 18 distinct biclusters. Furthermore, the study expands
                 upon a method that employs gene ontology analysis, facilitating the identification of subsets of
                 significant gene expression data from bicluster analysis results. A comprehensive procedure for
                 identifying significant gene subsets through a combination of bicluster and gene ontology analyses
                 is executed. The evaluation of sample classification results, characterized by these significant gene
                 subsets, underscores the method's effectiveness. The classification quality criteria exhibit relatively
                 high values, even with a reduced number of genes, indicating the method's efficiency.

                 Keywords
                 gene expression data, bicluster analysis, gene ontology analysis, biclustering quality criteria,
                 convolution neural network (CNN)
1


1. Introduction
   The significance of bicluster analysis in the processing of gene expression data is determined
by its possibility to allocate the subsets of mutually coherent gene expression values, which can
improve the effectiveness of disease diagnosis systems [1,2]. Unlike traditional clustering


IntelITSIS’2024: 5th International Workshop on Intelligent Information Technologies and Systems of Information
Security, March 28, 2024, Khmelnytskyi, Ukraine
∗ Corresponding author.
† These authors contributed equally.

   sergii.babichev@ujep.cz (S.Babichev); maks_kor@ukr.net (M. Korobchynskyi); ruminik33@ukr.net (M.
Rudenko); gbatenko@ksu.ks.ua (H. Batenko)
    0000-0001-6797-1467 (S.Babichev); 0000-0001-8049-4730 (M. Korobchynskyi); 0000-0002-9180-1510 (M.
Rudenko); 0000-0001-7007-4708 (H. Batenko)
            © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
techniques, which primarily concentrate on categorizing objects (either rows or columns)
according to their likeness, thus neglecting the potential significance of interactions among
various data dimensions, biclustering introduces a more comprehensive strategy. When
applying the bicluster analysis, the concurrent grouping of both rows and columns is
performed, thereby allocating subsets of data that exhibit mutual correlations. This capability
is particularly crucial when dealing with intricate datasets, as it facilitates a deeper
understanding of the underlying patterns and relationships. Biclustering methods stand out by
offering a dual-axis analysis framework, which is fundamental to dissecting complex biological
data, such as gene expression profiles. This analysis not only identifies clusters of genes with
similar expression patterns across a subset of conditions but also pinpoints conditions under
which these genes exhibit similar behavior. The dual grouping mechanism inherent in bicluster
analysis is indispensable for exploring and interpreting the multifaceted nature of gene
expression data, revealing insights into gene functions, regulatory mechanisms, and cellular
processes that might remain obscured under traditional clustering approaches.
   In the context of information technology and bioinformatics, ontology is a formalized
version of knowledge representation that utilizes a controlled vocabulary and a set of
relationships between terms to describe the domain being considered [3,4]. Such ontology can
be used for modeling the subject area and serve for information exchange, data integration, and
the development of various computer applications, including artificial intelligence. In
bioinformatics, ontologies are used to structure and standardize information about biological
processes, protein functions, cellular components, and more. The Gene Ontology (GO) is an
example of such a system that allows for the annotation of genes and protein products in a
unified form, ensuring consistency and compatibility of biological databases. Biclustering and
data analysis based on gene ontology are linked through their common goal - understanding
the biological mechanisms and functional characteristics of genes that are revealed in
experimental gene expression data. While biclustering allows identifying groups of genes that
show similar expression patterns under different conditions or in different types of samples (in
the presence of different types of diseases), which is important for understanding which genes
are co-regulated in certain physiological states or respond to certain external stimuli, data
analysis in biclusters based on gene ontology allows determining the possible role of the
highlighted genes in the cell or organism being studied. In other words, gene ontology
represents a functional annotation of genes. Integrating biclustering results with analysis based
on GO offers the possibility to gain a deeper understanding of the biological context of gene
expression patterns and highlight groups of genes that are co-expressed in the presence of a
certain type of disease. Thus, biclustering and analysis based on GO complement each other,
providing a mechanism for the identification and functional understanding of biological
modules in large sets of gene expression data. This fact indicates the relevance of the research
topic.

2. Related works
The application of bicluster analysis for processing complex data has been the focus of a
significant number of scientific works nowadays. For instance, [1] presents a review of
metaheuristic approaches to solving biclustering problems, which effectively address complex
optimization tasks within a limited computational time and adapt to various problem
formulations. Special attention is given to optimization methods and key search elements:
representation, objective function, and variation operators, with a discussion on single- and
multi-objective approaches and highlighting new research directions. In [2], the hidden block
structure in a heterogeneous panel data model is explored, based on the assumption that
regression coefficients have group structures among individuals and structural changes over
time, where change points can affect group structures, and structural changes can vary between
subgroups. To recover the hidden block structure, the authors propose a robust bicluster
approach that uses M-estimation and concave penalty fusion, as well as developing an
algorithm based on local quadratic approximation for optimizing the objective function, which
is more compact and efficient compared to the ADMM algorithm. Furthermore, an oracle
property for penalized M-estimators is established, and it is proven that the proposed estimator
recovers the hidden block structure with high probability, which is also confirmed by positive
results in practice through simulation studies on several datasets.
    In [5], to improve the quality of biclustering and module extraction, a combination of
methods based on Adaptive Resonance Theory (ART) is utilized - Biclustering ARTMAP
(BARTMAP) and Topological ART (TopoART), which together form TopoBARTMAP. This
method inherits the ability to detect topological associations while reducing data volume.
TopoBARTMAP was tested on 35 real cancer datasets and compared with other (bi)clustering
methods, showing statistically significant improvements over other evaluated methods in
experiments with ordered and shuffled data. It also demonstrated better results in identifying
constant, additive, multiplicative, and multiplicative-additive biclusters in experiments with 12
synthetic datasets. Graphical representation was refined to display associations of gene
biclusters and evaluated on the NCBI GSE89116 dataset, which contains expression levels of
39,326 probes selected over 38 observations. In [6], a new biclustering algorithm for binary data
called the Binary Biclustering Algorithm Based on Adjacency Difference Matrix (AMBB) was
proposed, improving the balance between execution time and efficiency. The AMBB algorithm
constructs an adjacency matrix based on adjacency difference values, and the resultant
submatrix, updated using the adjacency difference matrix, is referred to as a bicluster. This
allows for grouping genes that exhibit similar reactions under different conditions, which is
important for further gene analysis. Experiments on synthetic and real datasets visually
demonstrate the high practicality of the AMBB algorithm.
    Despite the significant advancements in the field of bicluster analysis for processing complex
data, there remain unresolved challenges, including the lack of effective methods for optimizing
the hyperparameters of the relevant algorithm. This issue is particularly pertinent in the context
of new approaches, such as the combination of methods based on Adaptive Resonance Theory
for biclustering, which require precise tuning of hyperparameters for efficient operation.
Additionally, there is the problem of balancing between execution time and algorithm
efficiency, especially in situations involving binary data, where the development of new
optimization strategies is needed to ensure fast and accurate data processing.
    The goal of the paper is the development and application of the technique of gene
expression data processing based on the joint use of bicluster analysis and gene ontology
analysis methods.
3. Material and methods
3.1. Biclustering quality criterion based on an assessment of mutual information
As mentioned earlier, biclustering is the process of simultaneously clustering rows and columns
of a matrix. In the context of gene expression data analysis, experimental data are represented
as a matrix, where rows correspond to genes and columns to experimental conditions or vice
versa, and the values in the matrix reflect the level of gene expression under a certain condition,
i.e., its expression. In this case, a bicluster identifies a subset of genes that exhibit similar
expression profiles across a subset of conditions. One way to assess the quality of a bicluster is
through the application of mutual information (MI) analysis between rows and columns. MI can
indicate how much the information in the rows and columns depends on each other, and thus,
a high MI value may indicate a high coherence of the bicluster. The most common methods for
estimating mutual information include the following [7,8]:
     • Mutual Information (MI):
                                                                       𝑝𝑝(𝑥𝑥, 𝑦𝑦)
                       𝑀𝑀𝑀𝑀(𝑋𝑋, 𝑌𝑌) = � � 𝑝𝑝(𝑥𝑥, 𝑦𝑦)𝑙𝑙𝑙𝑙𝑙𝑙 �                      �                  (1)
                                                                     𝑝𝑝(𝑥𝑥)𝑝𝑝(𝑦𝑦)
                                       𝑥𝑥∈𝑋𝑋 𝑦𝑦∈𝑌𝑌

where: X, Y are vectors between which the MI is assessed; p(x, y) is the joint probability
distribution of X and Y; p(x) and p(y) are the marginal probability distributions.
   • Normalized mutual information is defined as the ratio of mutual information to the
       geometric mean of the entropies of the two vectors:
                                                           𝑀𝑀𝑀𝑀(𝑋𝑋, 𝑌𝑌)
                                𝑁𝑁𝑁𝑁𝑁𝑁(𝑋𝑋, 𝑌𝑌) =                                                     (2)
                                                     �𝐻𝐻 (𝑋𝑋), 𝐻𝐻(𝑌𝑌)
where H(X) and H(Y) are the entropies of vectors X and Y, respectively.
  • Relative entropy, or Kullback-Leibler divergence, is a measure of the distance between
     two probability distributions:
                                                                      𝑃𝑃(𝑖𝑖)
                              𝐷𝐷𝐾𝐾𝐾𝐾 (𝑃𝑃||𝑄𝑄) = � 𝑃𝑃(𝑖𝑖)𝑙𝑙𝑙𝑙𝑙𝑙                                       (3)
                                                      𝑖𝑖              𝑄𝑄(𝑖𝑖)
where P(i) is the probability of distribution P, and Q(i) is the probability of distribution Q.
   It should be noted that this distance is not symmetric, i.e., 𝐷𝐷𝐾𝐾𝐾𝐾 (𝑃𝑃||𝑄𝑄) ≠ 𝐷𝐷𝐾𝐾𝐾𝐾 (𝑄𝑄||𝑃𝑃), hence,
to enhance objectivity, it is advisable to calculate the bidirectional divergence with subsequent
averaging of the two divergences:
                                            𝐷𝐷𝐾𝐾𝐾𝐾 (𝑃𝑃||𝑄𝑄) + 𝐷𝐷𝐾𝐾𝐾𝐾 (𝑄𝑄||𝑃𝑃)
                        𝐷𝐷𝐾𝐾𝐾𝐾 (𝑃𝑃, 𝑄𝑄) =                                                            (4)
                                                            2
   Mutual information is a measure of shared information between two vectors of random
variables, but it is not in itself a distance metric. Transforming the value of mutual information
into a distance can be achieved in various ways. Within the scope of our research, a metric
based on Shannon entropy is applied:
                        𝑑𝑑(𝑋𝑋, 𝑌𝑌) = 𝐻𝐻(𝑋𝑋) + 𝐻𝐻(𝑌𝑌) − 2𝑀𝑀𝑀𝑀(𝑋𝑋, 𝑌𝑌)                                 (5)
where H(X) and H(Y) are the Shannon entropy values of vectors X and Y, respectively, and
MI(X,Y) is the mutual information between vectors X and Y. In this case, if considering two
identical data distributions, then H(X) = H(Y) = MI(X,Y) and d(X,Y) = 0. As the difference between
data distributions increases, the value of mutual information decreases, leading to an increase
in the distance between these vectors.
    Calculating the value of the internal criterion for assessing the coherence of a bicluster
involves estimating the average distance both between the rows and between the columns of
the bicluster. The step-by-step procedure for calculating this criterion includes the following
steps:
    1. Calculation of the average distance among all pairs of rows in the bicluster:
                                                        𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛−1 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
                                          2
                 𝑄𝑄𝑄𝑄𝑟𝑟𝑟𝑟𝑟𝑟 =                           �              � 𝑑𝑑(𝑋𝑋𝑖𝑖 , 𝑋𝑋𝑗𝑗 )    (6)
                              𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 × (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 − 1)
                                                           𝑖𝑖=1      𝑗𝑗=𝑖𝑖+1

   2. Calculation of the average distance among all pairs of columns in the bicluster:
                                                       𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛−1 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
                                            2
                   𝑄𝑄𝑄𝑄𝑐𝑐𝑐𝑐𝑐𝑐 =                           �          � 𝑑𝑑(𝑌𝑌𝑖𝑖 , 𝑌𝑌𝑗𝑗 )      (7)
                                𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 × (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 − 1)
                                                         𝑖𝑖=1     𝑗𝑗=𝑖𝑖+1

   3. Calculation of the average value of the criteria (6) and (7):
                                           𝑄𝑄𝑄𝑄𝑟𝑟𝑟𝑟𝑟𝑟 + 𝑄𝑄𝑄𝑄𝑐𝑐𝑐𝑐𝑐𝑐                           (8)
                                  𝑄𝑄𝑄𝑄 =
                                                      2
    The minimum value of criterion (8) corresponds to the maximum level of bicluster
coherence. It should be noted that when applying any clustering algorithm to gene expression
data, characterized by a large volume of data, a fairly large number of biclusters with low
coherence may emerge, which do not allow for a definitive identification of the class of samples
under investigation. Moreover, the architecture of biclustering is largely determined by the
parameters of the algorithm used to form the cluster structure. Therefore, the problem of
optimizing algorithm parameters also arises, for which a Bayesian optimization algorithm is
used within the current research, entailing the following stages:
    Stage I. Definition of the objective function.
    1.1. Selection of the biclustering algorithm that takes the values of the objective function
parameters as input. Application of the algorithm to gene expression data. Formation of the
bicluster structure.
    1.2. Selection of a bicluster and assessment of its coherence using formulas (6) – (8).
    Stage II. Definition of the parameter change range.
    2.1. Determination of the range of variation for each parameter's values.
    Stage III. Selection of the model and launch of the optimization algorithm.
    3.1. Selection of the Bayesian optimization algorithm model. A model based on Gaussian
processes was used in the research.
    3.2. Application of the Bayesian optimization algorithm using the chosen model. Formation
of the best combination of hyperparameters according to the formulated objective function.
    Stage IV. Verification of the result and formation of a compromise decision regarding the
optimal combination of hyperparameters.
   4.1. Application of the aforementioned procedure to the first five biclusters (the number of
biclusters may vary during modeling) followed by an analysis of the obtained results to form a
compromise decision regarding the optimal combination of algorithm parameters.
   Stage V. Application of the biclustering algorithm to gene expression data. Formation of the
bicluster structure. Assessment of the coherence of the identified biclusters and formation of a
subset of biclusters with a high coherence value for further research.

3.2. Biclustering quality criterion based on an assessment of mutual information
The procedure for identifying significant genes based on gene ontology analysis involves the
use of the functions from the Bioconductor module [15,16] The practical implementation of this
procedure assumes the following steps:
    1. Loading necessary packages in R. During the simulation process, for the analysis of gene
ontology and the selection of informative genes, the following packages were used: GO.db [17],
org.Hs.eg.db [18], biomaRt [19], and topGO [20].
    2. Data preparation. Creation of a list of gene identifier vectors (ENTREZ ID) contained in
the identified biclusters.
    3. Mapping genes to GO terms using functions from the org.Hs.eg.db package. Retrieval of
GO terms for all genes contained in the bicluster.
    4. Statistical analysis of gene expression values to estimate the probability (p-value) that the
differences in gene expression values corresponding to different classes to which the samples
belong could have occurred by chance. At this stage, the ANOVA (Analysis of Variance)
statistical method was used, which allows comparing the mean values of three or more groups.
In the context of gene expression analysis, ANOVA is used to determine whether there is a
statistically significant difference in gene expression levels between different sample classes.
The obtained p-values in this case indicate the probability that the observed differences could
have occurred by chance. To adjust p-values (calculation of p-adjust), the Benjamini-Hochberg
(BH) method was used to control type I errors during multiple comparisons.
    5. Creating a topGO data object, which contains all gene identifiers and their scores, GO
annotations, the hierarchical structure of GO, and all other information necessary for analyzing
gene enrichment being studied.
    6. Performing enrichment tests. Within the framework of dissertation research, two types of
statistical tests were applied: the Fisher's exact test, based on counting the number of genes
corresponding to each sample class, and the Kolmogorov-Smirnov test, which calculates
enrichment based on assessments of gene expression values. Each of these tests provides an
estimate of how differentially expressed a particular gene is, allowing genes to be categorized
by their level of differential expression.
    7. Formation of a gene ontology analysis result matrix with identifiers of genes that
correspond to significant gene ontologies as a result of the analysis.
    8. Formation of a vector of significant genes for the respective bicluster by finding matches
between gene identifiers contained in the bicluster and gene identifiers identified as a result of
the gene ontology analysis.
4. Experiment, results and discussion
4.1. Modeling to determine the optimal hyperparameters of the “ensemble”
biclustering algorithm using the Bayesian optimization method
At this stage of modeling, gene expression data from patients studied for various types of cancer
diseases were used as experimental data. The data are freely available on the website of The
Cancer Genome Atlas project – TCGA [9] and contained nine sample classes, eight of which
correspond to different types of cancer diseases, and the ninth group of gene expression data
corresponds to subjects for whom cancer disease was not detected. Initially, the data contained
3269 samples and 19947 genes. After removing non-expressed and low-expressed genes for all
samples using the method presented in [10], the number of genes was reduced to 19265. In the
next step, mutually expressed gene expression profiles were identified from the obtained data
by applying the inductive spectral clustering algorithm according to the method presented in
[10], resulting in 3444 genes contained in the third cluster of the three-cluster structure
(corresponding to the highest accuracy of sample classification). Thus, the experimental data
had the form (3269×3444).
    The modeling process was carried out in the R software environment [11] using the biclust
package [12], which contains functions for applying various biclustering algorithms.
Considering the studies presented in [13], within the current research, the biclustering process
was performed using the ensemble algorithm [14], whose effectiveness, according to the results
presented in [13], is significantly higher compared to using other biclustering algorithms. The
outcome of the ensemble algorithm is determined by two parameters: the thresholding
coefficient (thr) and the approximate ratio of the number of rows to columns in biclusters
(simthr). The modeling process involved varying the values of these parameters within a
predefined range with a certain step, calculating the values of the criterion (8) at each step of
this procedure implementation. During the simulation procedure implementation, at each
iteration, the first five biclusters were allocated, for each of which the value of the criterion was
calculated. The evaluation of the biclustering was based on the average arithmetic value of all
components of the corresponding criterion, which determines the coherence level of each of
the identified biclusters. The analysis of the obtained results allowed us to conclude that the
maximum value of the objective function (negative value of the criterion (8)) is achieved at the
10th iteration, with the following values of the "ensemble" biclustering algorithm parameters
obtained: thr = 0.549; simthr = 0.151.
    Table 1 shows the results of the “ensemble” biclustering algorithm with optimal
hyperparameters values operation. As it can be seen, 18 biclusters various sizes were allocate
in this case. The next stage is the application of gene ontology analysis to the data in the
allocated biclusters.

4.2. Forming a subset of significant gene expression data using gene ontology
analysis
The simulation process regarding the use of the gene ontology method to form a vector of
significant genes, considering the type of samples, was carried out using gene expression data
from the first bicluster. Figure 1 illustrates the result of applying the ANOVA statistical test to
gene expression data (Volcano plot). The horizontal axis (Log2 Fold Change) on the diagram
shows the level of expression value of one group of genes compared to the expression of genes
from another group. To the left of the center are genes that have lower expression in the first
group compared to the second. Genes depicted to the right of the center have higher expression
in the first group. It is evident that the further a gene is located from the center, the higher its
level of differential expression. The vertical axis displays p-values (p-adjust) in a logarithmic
scale (-log10(p-adjust)). Genes with lower p-values, indicating greater statistical significance of
the difference in expression, are positioned higher on the graph. The analysis of the obtained
results allows us to conclude that a relatively large number of genes contained in the bicluster
can be identified as insignificant (located at the center bottom of the diagram), which confirms
the need for further analysis with the aim of their removal.

Table 1
The result of the biclust analysis of gene expression data when applying ther “ensemble”
algorithm
               BC 1                               BC 2                            BC 3
 Gene            Sample            Gene             Sample             Gene          Sample
  456     gbm        lgg   norm     430     gbm        lgg    norm       399   luad    lusc    stad
            53      494       4              137       524       5               7       5      49
          BC 4                           BC 5                       BC 8               BC 9
 Gene         Sample       Gene            Sample             Gene Sample      Gene        Sample
  339     luad      lusc     123    luad     lusc     stad     484      lusc    782    kirc norm
             5        23             105      32        22                4            197       1
               BC 6                               BC 7                           BC 11
 Gene            Sample             Gene            Sample             Gene          Sample
  189     luad      lusc    stad     329    luad      lusc     stad      463   luad    lusc    stad
           127        56     60              104        33      23              68      62       1
      BC 10                BC 12                           BC 14                      BC 16
 Gene Sample Gene              Sample       Gene              Sample           Gene        Sample
  461     sarc      505      acc    sarc     348      luad     lusc    norm     612    kirc norm
            13                5       2                 14       1        4            299       1
      BC 13                     BC 15                          BC 17                  BC 18
 Gene Sample Gene                  Sample            Gene          Sample      Gene        Sample
 450      lusc      315     luad    lusc     stad      346      acc     sarc    231     acc    sarc
            14                6       3       69                 6        5              8      30

   The next step involves performing enrichment tests with the calculation of p-values, which
determine the level of significance of genes according to the respective test. As mentioned
earlier, the Fisher's test and the Kolmogorov-Smirnov test were applied during the modeling
process. The results of the modeling are shown in Figure 2.
   According to both tests, 388 significant GO terms were identified out of 704. In the depicted
diagram, the size of a dot is proportional to the number of annotated genes for the
corresponding GO term, and its color represents the number of significantly differentially
expressed genes. The thresholding parameter, which separates genes into significant and
insignificant, was chosen at the level of the median of the gene significance vector. As can be
seen, red dots contain many more genes than blue ones.
   The analysis of the diagram presented in Figure 2 also allows us to conclude that the results
of applying the Fisher's test and the Kolmogorov-Smirnov test differ from each other. Thus,
some GO terms identified as significant using the Fisher's test have less significance when
applying the Kolmogorov-Smirnov test.


Figure 1: Visualization of the distribution of p-values of genes based on their significance level
(volcano plot).
   However, in some cases, it is possible to visually identify several GO terms for which the p-
values are nearly the same when both tests are applied. The obtained results also indicate that
despite the same number of significant genes when applying both tests, using only one test to
form a subset of significant genes based on GO analysis is not objective. In this case, increasing
the objectivity of the analysis can be achieved by applying both tests with the formation of
intermediate decisions followed by their combination to select unique identifiers of significant
genes.
   In Figures 3 and 4, the results of applying GO analysis with the identification of ten
significant GO terms using the Fisher's and Kolmogorov-Smirnov tests, respectively, are
presented. Significant nodes are represented as rectangles. The color of the node represents
relative significance, varying from dark red (most significant) to bright yellow (least
significant). The analysis of the obtained graphs confirms the conclusion regarding the
inconsistency of results when applying different tests during GO analysis aimed at forming a
subset of significant genes. As evident from the figures, when identifying the ten most
significant GO terms, the results differ both in the topology of the graph and in the significance
level of the GO terms, which serve as the nodes of the graph. This fact corroborates the
hypothesis about the advisability of applying both tests for forming a subset of significant
genes.
   As the simulation results have shown, the outcome of applying GO analysis is a table of
convergence between GO terms and gene identifiers corresponding to the respective terms.
Here, a single GO term can correspond to a large number of genes.
Figure 2: Scatter plot of the distribution of p-values calculated using the classical Fisher's test
(x-axis) and the Kolmogorov-Smirnov method (y-axis).

    For instance, in the case of applying the Fisher's test, the total number of genes
corresponding to 388 significant GO terms was 26,092, whereas for the Kolmogorov-Smirnov
test, it was 24,456.
    In line with the set objective, the final step involved associating gene identifiers contained
in the bicluster with gene identifiers identified through GO analysis. As a result, 270 genes were
identified using the Fisher's test and 254 genes were identified using the Kolmogorov-Smirnov
test. The total number of genes in the bicluster at this point was 465. By combining the results
of applying both tests and identifying unique gene identifiers, the total number of significant
genes amounted to 296.
    However, it should be noted that the above type of GO analysis is effective when applied to
data containing at least two sample classes, with each class having a sufficiently large number
of samples. If these conditions are not met, the ANOVA test either will not work or may provide
unreliable results.
    For this reason, this type of GO analysis is suitably applied in the implementation of cluster
analysis of gene expression profiles, where each cluster corresponds to a complete set of sample
classes with a sufficiently large number of samples in each class.
    When applying bicluster analysis, the condition for using the ANOVA test may not be met,
as this can identify biclusters containing only one sample class, or the number of samples
corresponding to one of the classes may be quite small, which reduces the reliability of the test
results. In this case, it is appropriate to apply a statistical test based on the assessment of
whether the number of genes associated with a certain GO term in the list of genes comprising
the bicluster differs from the number expected by chance. In other words, the statistical test
compares the number of genes in the selected GO category contained in the bicluster with their
total number in the genome of the studied object.


Figure 3: The result of applying GO analysis with the identification of ten significant GO terms
using the Fisher's test.


Figure 4: The result of applying GO analysis with the identification of ten significant GO terms
using the Kolmogorov-Smirnov test.
   Within the framework of dissertation research, the statistical test is implemented in the R
programming environment by applying the function enrichGO() from the clusterProfiler
package [21]. The use of the statistical test using the enrichGO() function involves two stages:

   •    Implementing the hypergeometric test by comparing the number of genes associated
with a certain GO term to what is expected by chance. It should be noted that the GO term
database must correspond to the type of biological object being studied. The GO term database
for Homo sapiens “org.Hs.eg.db” was used in the modeling process.
   •    p-value correction. This step is necessitated by the large number of GO terms being
analyzed, which requires adjusting p-values to control for multiple comparisons. The
application of the Benjamini-Hochberg (BH) method helps to reduce the type I error.

   The result of modeling regarding the application of GO analysis based on the enrichGO()
function is a table with GO terms, which also contains p-values, adjusted p-values, and the
number of genes in each term. Table 2 presents the result of the GO analysis of gene expression
data from the first bicluster (the first 10 rows are shown). The threshold value that separates
significant and insignificant GO terms was set at 0.05. At this threshold, 118 significant GO
terms were identified. Figure 5 depicts the network of connections of the five most significant
GO terms and their corresponding genes. As can be seen, as in the previous modeling, each GO
term corresponds to a large number of genes, which confirms the need for filtering gene
identifiers at a certain stage of data processing. Based on the modeling results regarding the
application of gene ontology analysis to the bicluster structure formed in the previous step,
which included 3444 genes, 1780 significant genes were identified. Thus, as a result of applying
GO analysis, a new gene expression data matrix was formed: (3269×1780).

Table 2
Results of GO analysis using the statistical test based on the enrichGO() function with the
application of gene expression data from the first bicluster

  №            ID              GeneRatio           p-value              p.adjust          Count
  1        GO:0007409           42/428          6.361398e-15         2.358170e-11          42
  2        GO:0010975           41/428          4.477555e-14         8.299147e-11          41
  3        GO:0050771           13/428          2.408776e-09         2.813517e-06          13
  4        GO:0050770           19/428          3.035896e-09         2.813517e-06          19
  5        GO:0050890           26/428          1.789554e-08         9.847434e-06          26
  6        GO:0007411           22/428          1.859510e-08         9.847434e-06          22
  7        GO:0097485           22/428          1.859510e-08         9.847434e-06          22
  8        GO:0031345           19/428          8.314768e-08         3.852856e-05          19
  9        GO:0048675           15/428          1.020838e-07         4.204719e-05          15
  10       GO:0010977           16/428          1.213019e-07         4.496662e-05          16
  …            …                  …                   …                    …                …

    To assess the effectiveness of the proposed technology, a one-dimensional two-layer GRU
recurrent neural network was applied to the obtained data. The optimal number of neurons in
the layers was determined using the Bayesian optimization algorithm. The number of neurons
varied in the range from 20 to 100. The modeling results showed that increasing the number of
neurons is not advisable, as it led to overfitting of the network. The discrepancy in classification
results between the data used for training and model validation increased. According to the
results of the Bayesian optimization algorithm, the number of neurons in the first and second
layers was 98 and 75, respectively. Following the classical classifier application methodology,
at the first stage, the data (samples) were divided into two subsets in a ratio of 0.7/0.3 (training
subset and testing subset). At the second stage, the training subset was also divided into two
subsets in a ratio of 0.8/0.2. The smaller subset was used for model validation during the training
process. At each stage of applying the Bayesian optimization algorithm during network
training, 10-fold cross-validation was applied. Table 3 presents the classification results of
samples that make up the testing subset of significant gene expression data.

Table 3
The results of classifying the samples comprising the test subset of expression data from
significant genes selected using the criterion based on the MI evaluation
    Class                Classification quality criteria               Total  Correctly
               Precision       Recall      F1-score      Accuracy    number   identified
     acc          0.893        0.893         0.893                       28       25
    gbm           0.915        0.915         0.915                       59       54
    kirc          0.982        0.964         0.973                      169      163
    luad          0.982        0.975         0.979        0.943         166      162
     lgg          0.873        0.926         0.899                      149      138
    lusc          0.923        0.889         0.906                      135      120
   normal         0.908        0.952         0.929                       62       59
    sarc          0.962        0.949         0.955                       79       75
    stad          0.977        0.963         0.970                      134      129

   The analysis of the obtained results allows us to conclude that based on the group of
classification quality criteria, the formation of bicluster structures followed by data filtering
through the application of gene ontology analysis allows the formation of subsets of significant
and mutually correlated gene expression data. The classification quality criteria values are
consistently high in all cases, despite the limited number of genes that comprised the
experimental data at the initial stage.


Figure 5: Network of connections of the five most significant GO terms with their
corresponding genes.
5. Conclusions
    This study presents the research results regarding the improvement of the methods of
biclustering gene expression data by refining the quality criteria for biclustering, which allows
us to evaluate the bicluster structure generated during the biclustering algorithm's execution.
A novel internal quality criterion based on mutual information evaluation, both among the rows
and columns of a bicluster, was proposed. Furthermore, a hybrid biclustering model for gene
expression data processing has been proposed, integrating the ensemble biclustering algorithm
and the Bayesian optimization method to fine-tune biclustering algorithm parameters. This
model employs a target objective function based on the proposed quality criterion. Simulations
using gene expression data from patients with various types of cancer showed that the objective
function's application, using a criterion based on mutual information evaluation, formed the
bicluster structure with 18 biclusters.
    In this study also further developed a method based on gene ontology analysis in models,
allowing for the formation of a subset of significant gene expression data using the results of
the bicluster analysis.
    We proposed and implemented the stepwise procedure for forming subsets of significant
genes through the joint use of bicluster analysis and gene ontology analysis. The classification
results, obtained using allocated significant gene expression data, underscored the
methodological precision, with high scores across various metrics: precision, recall, F1-score,
and accuracy, the values of which are varied within the range from 0.873 to 0.982, 0.889 to 0.975,
0.899 to 0.979, respectively, with an overall accuracy of 0.943.
    The obtained results not only affirm the effectiveness of the joint use of biclustering and
gene ontology analysis but also highlight the potential of applying deep neural network models
to processing complex biological data.
    The further prospects of the authors' research are the application of the proposed method
within the framework of hybrid models of gene expression data processing based on the joint
use of cluster-bicluster analysis, gene ontology analysis and deep learning methods.


References
[1] A. José-García, J. Jacques, V. Sobanski, C. Dhaenens. Metaheuristic Biclustering
    Algorithms: From State-of-the-art to Future Opportunities. ACM Computing Surveys,
    2023, vol. 56(3), art. no. 69. doi: 10.1145/3617590
[2] W. Cui, Y. Li. Bicluster Analysis of Heterogeneous Panel Data via M-Estimation.
    Mathematics, 2023, vol. 11(10), art. no. 2333. doi: 10.3390/math11102333
[3] P.D. Tomas, D. Ebert, A. Muruganujan, et al. PANTHER: Making genome-scale
    phylogenetics accessible to all. Protein Science, 2021, vol. 31, pp. 8-22. doi:
    10.1002/pro.4218J. Cohen (Ed.), Special issue: Digital Libraries, volume 39, 1996.
[4] Ye. Hnatchuk, Т. Hovorushchenko. Rules and method of supporting the decision-making
    regarding the possibility of extracorporeal fertilization. Computer Systems and
    Information Technologies, 2022, №3, pp. 6-10.
[5] R. Yelugam, L.E. Brito da Silva, D.C. Wunsch. Topological biclustering ARTMAP for
     identifying within bicluster relationships. Neural Networks, 2023, vol. 160, pp. 34-
     49. doi: 10.1016/j.neunet.2022.12.010
[6] H.-M. Chu, J.-X. Liu, K. Zhang, C.-H. Zheng, J. Wang, X.-Z Kong. A binary biclustering
     algorithm based on the adjacency difference matrix for gene expression data analysis. BMC
     Bioinformatics, 2022, vol. 23(1), art. no. 381. doi: 10.1186/s12859-022-04842-4.
[7] S. Babichev, J. Krejci, J. Bicanek, V. Lytvynenko. Gene expression sequences clustering
     based on the internal and external clustering quality criteria. Proceedings of the 12th
     International Scientific and Technical Conference on Computer Sciences and Information
     Technologies, CSIT 2017, 2017, vol.1, art. no. 8098744, pp. 91-94. doi: 10.1109/STC-
     CSIT.2017.8098744.
[8] P. Schürger, V. Engel. Differential Shannon Entropies Characterizing Electron–Nuclear
     Dynamics and Correlation: Momentum-Space Versus Coordinate-Space Wave Packet
     Motion. Entropy, 2023, vol. 25(7), art. no. 970. doi: 10.3390/e25070970.
[9] The           Cancer         Genome          Atlas         Program.         Available       on:
     https://www.cancer.gov/ccg/research/genome-sequencing/tcga.
[10] S. Babichev, L. Yasinska-Damri, I. Liakh, J. Škvor. Hybrid Inductive Model of Differentially
     and Co-Expressed Gene Expression Profile Extraction Based on the Joint Use of Clustering
     Technique and Convolutional Neural Network. Applied Sciences (Switzerland), 2022, vol.
     12(22), art. no. 11795. doi: 10.3390/app122211795.
[11] The R Project for Statistical Computing. Available on: https://www.r-project.org/
[12] S. Kaiser, R. Santamaria, T. Khamiakova, et al. Biclust: Bicluster Algorithms. Available on:
     https://cran.r-project.org/web/packages/biclust/index.html.
[13] S. Babichev, V. Osypenko, V. Lytvynenko, M. Voronenko, M. Korobchynskyi. Comparison
     Analysis of Biclustering Algorithms with the use of Artificial Data and Gene Expression
     Profiles. (2018), IEEE 38th International Conference on Electronics and Nanotechnology,
     ELNANO 2018 - Proceedings, 2018, art. no. 8477439, pp. 298-304. doi:
     10.1109/ELNANO.2018.8477439.
[14] P. Foszner, W. Labaj, A. Polanski, M. Staniszewski. Consensus Algorithm for Bi-clustering
     Analysis. Lecture Notes in Computer Science (including subseries Lecture Notes in
     Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, pp. 557-570. doi:
     10.1007/978-3-031-08754-7_61.
[15] Bioconductor. Available on: https://www.bioconductor.org/developers/package-
     guidelines/
[16] Bioconductor: Open source software for Bioinformatics. Available on:
     https://www.bioconductor.org/
[17] M. Carlson. GO.db: A set of annotation maps describing the entire Gene Ontology. R
     package version 3.8.2. 2019.
[18] M. Carlson. org.Hs.eg.db: Genome wide annotation for Human. R package version 3.8.2.
     2019.
[19] Bioconductor package biomaRt. Available on https://github.com/grimbough/biomaRt.
[20] A. Alexa, J. Rahnenfuhrer. topGO: Enrichment Analysis for Gene Ontology. R package
     version 2.54.0. 2023.
[21] T. Wu, E. Hu, S. Xu, et al. ClusterProfiler 4.0: A universal enrichment tool for interpreting
     omics data. Innovation, 2021, vol. 2(3), art. no. 100141. doi: 10.1016/j.xinn.2021.100141.