=Paper=
{{Paper
|id=Vol-3892/paper2
|storemode=property
|title=A Hybrid Inductive Model for Gene Expression Data Processing Using Spectral Clustering
|pdfUrl=https://ceur-ws.org/Vol-3892/paper2.pdf
|volume=Vol-3892
|authors=Sergii Babichev,Oleg Yarema,Ihor Liakh,Maria Popa,Anca Andreica,Eugene Fedorov,Tetyana Utkina,Olga Nechyporenko,Maryna Leshchenko,Kostiantyn Rudakov,Ihor Zubko,Serge Dolgikh,Oksana Mulesa,Volodymyr Sabadosh,Ihor Berezhnyi,Adrian Nakonechnyi,Oleh Berezsky,Oleh Pitsun,Grygory Melnyk,Hanna Poperechna,Yurii Baryshev,Vladyslava Lanova,Oles Telikhovskyi,Roman Komarnytsky,Vasyl Koval,Khrystyna Lipianina‐Honcharenko,Vitaliy Dorosh,Mykola Telka,Samuel Gbenga Faluyi,Yousra Chabchoub,Maurras Togbe,Jérémie Sublime,Mykola Butkevych,Ievgen Meniailov,Kseniia Bazilevych,Yurii Parfeniuk,Dmytro Chumachenko,Sebastian Górecki,Wiktoria Duszczyk,Zuzanna Huda,Andrzej Faryna,Aleksandra Tatka,Mohamed Adel,Mohamed Aborizka,Yurii Oliinyk,Mariia Kapshuk,Leonid Oliinyk,Inna Rozlomii,Andrii Yarmilko,Serhii Naumenko
|dblpUrl=https://dblp.org/rec/conf/iddm/BabichevYL24
}}
==A Hybrid Inductive Model for Gene Expression Data Processing Using Spectral Clustering==
A hybrid inductive model for gene expression data
processing using spectral clustering⋆
Sergii Babichev1,2,*,†, Oleg Yarema 3,† and Ihor Liakh4,†
1
Kherson State University, 27, University street, 73000, Kherson, Ukraine
2
Jan Evangelista Purkyne University in Usti nad Labem, Pasteurova, 15, 400 96, Usti nad Labem, Czech Republic
3
Ivan Franko National University, 1, Universytetska str. 79000, Lviv, Ukraine
4
Uzhhorod National University, 14, University street, 88000, Uzhhorod, Ukraine
Abstract
One of the key directions in modern bioinformatics is the development of systems for diagnosing various
diseases using gene expression data. Clustering gene expression profiles is a critical step in disease
diagnosis systems. In this study, we propose a hybrid inductive model for clustering gene expression
profiles using the spectral clustering algorithm. The implementation of this model aims to reduce
reproducibility errors by serializing the data processing flow and optimizing clustering based on both
internal and external quality criteria. The model is presented as a block diagram, and its practical
implementation has demonstrated the high effectiveness of the proposed approach. The model’s
performance was evaluated using a convolutional neural network. The experimental dataset consisted of
gene expression values assigned to the identified clusters. The simulation results indicate that the highest
classification accuracy was achieved with a three-cluster structure, which corresponded to the highest
balance between internal and external clustering quality criteria. These findings create opportunities for
enhancing existing gene expression clustering models through more precise tuning of clustering
algorithm hyperparameters, guided by the principles of inductive methods for analyzing complex systems.
Keywords
Gene expression data, spectral clustering, internal and external clustering quality criteria, convolution
neural network (CNN), classification accuracy
1
1. Introduction
Gene expression (GE) data are a crucial element of modern research in bioinformatics and
genomics. They enable the investigation of gene functional activity under various conditions and
developmental stages while also aiding in the discovery of molecular mechanisms underlying
biological processes. This, in turn, provides a foundation for developing and refining personalized
medicine systems through accurate analysis and processing of GE data in diagnostic models,
reconstruction, simulation, and validation of gene regulatory network (GRN) models [1]. As
demonstrated by the analysis of contemporary GE data [2], the human genome consists of tens of
thousands of genes, with around 25,000 of them active. The activity (expression) of these genes is
governed by various processes that dictate an organism's functioning. Thus, identifying the subset
of genes that directly determine the state of the organism remains one of the pressing challenges in
bioinformatics, and as of now, it does not have a definitive solution.
A significant number of scientific studies are currently focusing on processing GE data to
identify co-expressed genes through cluster analysis [3-6]. These studies aim to refine clustering
techniques to more accurately group genes with similar expression patterns, which can reveal
functional relationships and regulatory mechanisms within the genome. The results of such
IDDM’24: 7th International Conference on Informatics & Data-Driven Medicine, November 14 - 16, 2024, Birmingham, UK
∗
Corresponding author.
†
These authors contributed equally.
sergii.babichev@ujep.cz (S. Babichev); oleh.yarema@lnu.edu.ua (O. Yarema); ihor.lyah@uzhnu.edu.ua (I. Liakh)
0000-0001-6797-1467 (S. Babichev); 0000-0003-3736-4820 (O. Yarema); 0000-0001-5417-9403 (I. Liakh)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
analyses are essential for advancing our understanding of gene networks and improving predictive
models for various biological conditions. Thus, in [3], the authors focus on improving the process
of identifying subsets of co-expressed genes by leveraging advanced cluster analysis techniques.
The proposed approach enhances the quality of GE data imputation by exploiting multiple
clustering solutions, enabling more accurate grouping of genes with similar expression patterns.
This method significantly contributes to the allocation of gene subsets with shared functional
activity, offering a robust tool for bioinformatics research. Study [4] introduces a Cluster
Decomposition-based Anomaly Detection method, known as scCAD, to improve the identification
of co-expressed genes in single-cell GE data. By iteratively refining clusters based on differential
signals, scCAD enhances the detection of rare cell types that are often missed by traditional
clustering methods. Benchmarking on 25 datasets shows scCAD's superiority in identifying rare
cell types and disease-related immune subtypes, providing valuable insights into complex
biological processes. In [5], the authors emphasize the importance of clustering in optimizing the
analysis of single-cell chromatin accessibility (scATAC-seq) and multi-omic datasets. They
benchmark eight feature engineering pipelines across various data processing stages, assessing
their ability to discover and differentiate cell types based on clustering performance. SnapATAC
and SnapATAC2 are highlighted as the most effective methods for datasets with complex cell-type
structures, proving critical in extracting meaningful insights from high-dimensional and noisy data.
Study [6] discusses the challenges of developing effective clustering algorithms for spatial
transcriptomics (ST) data, focusing on defining spatially coherent regions within tissue slices and
integrating multiple slices from different sources. The authors systematically benchmark a range of
state-of-the-art clustering, alignment, and integration methods using diverse datasets, evaluating
their performance with eight metrics related to spatial accuracy and contiguity. Based on these
results, the study provides detailed recommendations for selecting the most suitable methods for
specific datasets and offers guidance for future method development in ST data analysis.
However, it should be noted that the successful application of cluster analyze the data to
identify and form subsets of co-expressed GE profiles for disease diagnosis systems faces several
limitations and unresolved challenges. Despite significant progress in refining clustering
techniques to better group genes with similar expression patterns, certain obstacles persist. A key
limitation is the difficulty in identifying rare or subtle gene expressions, especially when data is
noisy or high-dimensional, such as in scRNA-seq or spatial transcriptomics. While methods like
scCAD and SnapATAC2 have advanced in this area, they still rely on iterative refinement and
sophisticated benchmarks and may overlook rare gene sets or struggle with large, complex datasets.
Another unsolved issue is the challenge of integrating multiple tissue samples or datasets,
particularly in spatial transcriptomics and multi-omics studies, where spatial coherence and
alignment are critical but difficult to achieve. Existing methods often lack scalability or struggle
with generalizing across diverse data sources. Furthermore, many studies highlight the lack of
comprehensive benchmarks, limiting the ability to systematically compare and improve clustering
algorithms.
In sum, while current research has made strides in improving gene expression clustering,
developing more robust, scalable, and generalizable methods remains a pressing need to ensure the
accurate formation of co-expressed gene subsets for reliable disease diagnosis based on GE data.
The performance of the spectral clustering algorithm for GE data clustering has shown promise
due to its ability to effectively handle complex, non-linear relationships within high-dimensional
datasets [7,8]. In this study, we continue the research presented in [9,10] and propose a hybrid
inductive model that utilizes spectral clustering to form subsets of co-expressed genes, enhancing
the ability to detect subtle patterns in gene expression profiles. Spectral clustering operates by
transforming data into a lower-dimensional space, where traditional clustering techniques can be
applied more efficiently, thus overcoming limitations of other algorithms that may struggle with
high-dimensionality and noise inherent in gene expression data.
The hybrid approach combines spectral clustering with inductive methods of complex system
analysis to further improve accuracy in grouping co-expressed genes, leveraging the algorithm's
strength in identifying clusters of varying shapes and sizes. By applying spectral clustering to gene
expression data, we achieve better delineation of gene subsets that are often difficult to separate
using standard techniques. This model has the potential to significantly enhance disease diagnosis
systems by improving the precision and scalability of clustering in complex biological datasets.
2. Materials and Methods
Spectral clustering (SC) is a modern technique that helps identify clusters with arbitrary shapes by
leveraging similarity matrices between the studied objects [11-13]. Compared to traditional
clustering methods, such as k-means, hierarchical agglomerative, and divisive approaches, spectral
clustering provides several significant advantages. It often delivers superior results in terms of
clustering quality and is also relatively easy to implement, utilizing standard linear algebra
operations efficiently. Unlike many conventional algorithms, spectral clustering does not rely on
the absolute positions of objects in space; instead, it focuses on analyzing the affinities between
them, which makes it especially effective for grouping complex structures. The typical
implementation of spectral clustering follows a sequence of key steps:
1. Constructing the Similarity Graph. A similarity graph � = (�, �) is an undirected
graph comprising a set of nodes � = �1 , �� (the objects being studied) and a set of
edges � = �푖 , which connect nodes i and j and define the measure of proximity between
them. Two nodes are considered connected if the similarity value �푖 between the
corresponding objects (nodes of the graph) exceeds a certain threshold, and the edge is
assigned a weight �푖 . In this scenario, the clustering task can be formalized as follows: the
graph structure should be constructed so that the edges between different groups (clusters)
have very low weights, indicating that objects in different clusters are as dissimilar as
possible. Conversely, edges between nodes within the same group should have high
weights, signifying that objects within the same cluster are as similar as possible.
Constructing the similarity graph involves calculating a similarity matrix using an
appropriate proximity metric based on the characteristics of the objects being studied. For
instance, when clustering gene expression profiles, a hybrid modified metric based on
maximizing mutual information and Pearson correlation is used. In conclusion, the
similarity graph is an undirected weighted graph where the strength of connection between
nodes is determined by the weight of the edge connecting them. The degree of a node is
defined as the sum of the weights of the edges connecting this node to its neighbors:
�
�푖 = �푖 (1)
=1
where m is the number of nodes directly connected to node i. Note that if two nodes are not
directly connected, the weight of the edge between them is zero. Based on node degrees,
the degree matrix D is formed, which is a diagonal matrix with the degrees of the nodes
�1 , �� on the main diagonal. This process creates the conditions for cluster formation by
initializing a threshold coefficient that limits the number of connections with non-zero
weights. All components in a subset of objects A are considered connected if the weights of
direct or indirect connections between all nodes in A are greater than zero, and the weights
between nodes in A and those in other subsets are zero. Depending on how the set of
objects and the corresponding similarity matrix are transformed into a similarity graph, the
following types of graphs can be identified:
ɛ-neighborhood graph: This type connects all points (object identifiers) whose pairwise
distances are smaller than a predefined ɛ-neighborhood. Since distances between all pairs
are measured on the same scale (no larger than ɛ), the graph is typically unweighted and
does not require additional information regarding the strength (weight) of the connections.
k-nearest neighbors graph: In this type of graph, node i is connected to node j if j is one of
the k-nearest neighbors of i. The weight of the edges is initialized based on the similarity
matrix, making this a weighted graph.
Fully connected graph: This type connects all nodes with positive connection strengths
based on the similarity matrix. Local ɛ-neighborhoods are formed using appropriate
similarity functions, such as a Gaussian similarity function.
2. Constructing the Laplacian Matrix and Computing Eigenvectors. The Laplacian
matrix, derived from the graph's Laplacian, is a central component in spectral clustering.
For this process, we assume that the graph G is undirected and weighted, with its weight
matrix denoted as W. The eigenvectors of the similarity matrix can be either normalized or
unnormalized. The eigenvalues of W are sorted in ascending order, and the first k
eigenvectors correspond to the smallest k eigenvalues. The Laplacian matrix can be
computed using either normalized or unnormalized values.
3. Cluster Formation Using the k-Nearest Neighbors Method. In this method, the
clustering structure is determined by applying the k-nearest neighbors algorithm. The
algorithm assigns each node to a cluster based on its proximity to the nearest neighbors.
2.1. Step-by-step procedure for implementing the SC algorithm
Assume that the experimental data consists of n objects (points in an m-dimensional space), where
the distances between all pairs of points are defined by a similarity matrix. Depending on the
method used to construct the similarity graph and compute the Laplacian matrix, several step-by-
step procedures form the basis of the SC algorithm.
1. SC Algorithm based on the unnormalized Laplacian matrix.
Input: Similarity matrix � ∈ ��×� , number of clusters k.
Steps:
Build the similarity graph using the values of the similarity matrix W to initialize the
weights of the corresponding edges.
Calculate the unnormalized Laplacian matrix L.
Calculate the first k eigenvectors of L: �1 , �� . Form matrix � ∈ ��×� , where each column
represents an eigenvector �1 , ��.
For each 푖 = 1, �, extract vector �푖 ∈ �� , corresponding to the i-th row of matrix U.
Clustering the points, corresponding to the vectors �푖 ∈ �� , using the k-means
algorithm to form clusters �1 , �� .
Output: Clusters �1 , �� , where �푖 = |� ∈ �푖 contains the points in the i-th cluster.
2. SC Algorithm based on the normalized Laplacian using the Shi and Malik method.
Input: Similarity matrix � ∈ ��×� , number of clusters k.
Steps:
Build the similarity graph using the values of the similarity matrix W to initialize the edge
weights.
Calculate the normalized Laplacian matrix L.
Calculate the first k eigenvectors of L, corresponding to the equation 퐿� = 휆��, where λ
is the eigenvalue corresponding to eigenvector u. Form matrix � ∈ ��×� , where each
column represents an eigenvector �1 , �� .
For each 푖 = 1, �, extract vector �푖 ∈ �� , corresponding to the i-th row of matrix U.
Clustering the points, corresponding to the vectors �푖 ∈ �� , using the k-means
algorithm to form clusters �1 , ��.
Output: Clusters �1 , �� , where �푖 = |� ∈ �푖 contains the points in the i-th cluster.
3. SC Algorithm based on the normalized Laplacian using the Ng, Jordan, and Weiss method.
Input: Similarity matrix � ∈ ��×� , number of clusters k.
Steps:
Build the similarity graph initializing edge weights with the values from matrix W.
Calculate the normalized Laplacian matrix 퐿��� .
Calculate the first k eigenvectors of 퐿��� : �1 , �� . Form matrix � ∈ ��×� , where each
column represents an eigenvector �1 , �� .
Normalize the rows of matrix U to form matrix� ∈ ��×� , according to the equation:
�푖
�푖 =
(2)
�
�
2
=1 푖
For each 푖 = 1, �, extract vector �푖 ∈ �� , corresponding to the i-th row of matrix T.
Clustering the points, which is associated with the vectors �푖 ∈ �� , using the k-means
algorithm to form clusters �1 , ��.
Output: Clusters �1 , �� , where �푖 = |� ∈ �푖 contains the points in the i-th cluster.
It is important to note that, in all cases, the results of the algorithm depend on the method used
to construct the similarity matrix (i.e., how object proximity is measured) and the desired number
of clusters. However, in many instances, the number of clusters cannot be predetermined, making
it necessary to apply various clustering methods alongside quantitative criteria to evaluate
clustering quality. The choice of proximity metric depends on the type of data. For the gene
expression profiles analyzed in the simulation, a modified hybrid metric is used, combining a
mutual information maximization criterion with Pearson's consistency criterion [14]. The number
of clusters is determined using methods based on an objective inductive clustering technique
2.2. Hybrid inductive model for clustering GE profiles using the SC
algorithm
The practical implementation of the step-by-step procedure for GE profiles clustering using the
SC algorithm comprises the following phases:
Stage I. Dataset Preparation, Model Initialization
1.1. Form the GE matrix X, where � ∈ ��×� . Here, m and n are the amount of genes and
samples, respectively.
1.2. Construct a measure to assess the similarity of GE profiles.
1.3. Develop functions to calculate various type of criteria (internal, external, balance) for
evaluating the quality of GE profiles clustering.
1.4. Split the GE profiles into two comparable groups A and B.
1.5. Calculate the distance matrices for the GE profiles allocated in the comparable groups.
1.6. Set the range of possible clusters quantity, kmin and kmax.
Stage II. Clustering of GE Data and Quality Evaluation
2.1. Initialize the number of clusters � = �풎� .
2.2. Perform grouping of GE data in the subsets A and B.
2.3. Calculate the internal and corresponding external quality criteria.
2.4. When k is less than kmax , increment the cluster count by one and repeat step 2.2. If not,
proceed to compute the balance criterion using the internal and external metrics obtained.
2.5. Evaluate the results and identify the optimal clustering that maximizes the balance criterion.
Stage III. GE Data Classification
3.1. Create subsets of GE data from the identified clusters to be used as input for a convolutional
neural network (CNN).
3.2. Utilize a CNN on the GE data allocated within formed clusters, assess classification
performance metrics.
3.3. Evaluate the findings and generate subsets of co-expressed GE profiles.
3. Simulation, Results and Discussion
The modeling was executed using GE data from the GSE19188 dataset [15], which involved
patients undergoing lung cancer research. The data, obtained from the Gene Expression Omnibus
(GEO) [16], includes DNA analysis results from 156 patients using DNA microarray technology. Of
these, 65 were determined to be healthy, whereas 91 were diagnosed with cancer. After filtering
out low-expressed genes, the dataset matrix was reduced to a size of (156×10,000). Based on
previous research [14], we used the WB-index [17] and the PBM criterion [18] as internal
clustering quality metrics. In this case, the most effective clustering occurs when the WB-index is
minimized and the PBM-index is maximized. The external quality index was determined by the
normalized difference of the respective internal measures, computed on subsets A and B. The
balance criterion was accessed using Harrington method in accordance to technique, described in
detail in [9,10]. Figure 2 depicts the simulation results. The modeling process involved varying the
number of clusters between 2 and 10.
As observed, the internal and external measures of clustering performance can sometimes
conflict with each other, highlighting the importance of calculating the balance measure, which
incorporates both internal and corresponding external metrics. Its maximum value is achieved
when the gene expression profiles are grouped into three clusters (Figure 2d). The internal WB-
index indicates that the best clustering solution involves three clusters for subset A and two for
subset B (Figure 2a). When using the internal PBM index, the optimal clustering for both subsets
aligns with a three-cluster structure (Figure 2b). For the external metrics, the most effective
clustering is a three-cluster configuration when applying the WB-index and a four-cluster
structure when using the PBM criterion (Figure 2c).
The next step in implementing the algorithm, whose structural flowchart is shown in Figure 1,
involves applying a CNN to the GE data within the identified groups. To validate the previous
findings on the effectiveness of clustering quality criteria, structures containing 2, 3, and 4 clusters
were examined. The experimental data consisted of 10,000 gene expression profiles from 156 lung
cancer patients. The modeling results are presented in Table 1.
These findings demonstrate that a three-cluster configuration offers the best performance
regarding classification accuracy and the loss function during neural network training. It's worth
mentioning that classification accuracy stays consistently high in all cases, due to the CNN's
effectiveness with this data type and its resilience to noise. The classification accuracy was
assessed on a test subset of data that was not used during the training phase of the neural network.
Notably, for the three-cluster structure, a perfect classification accuracy of 100% was attained for
the third cluster, which contains 4,964 genes, with the lowest loss function value. In the remaining
clusters of this structure, 38 out of 39 objects in the test subset were accurately classified. These
findings provide a strong foundation for improving diagnostic objectivity in complex diseases,
allowing for balanced decision-making based on classification results from different gene
expression clusters through the application of an alternative voting method.
Figure 2: The modeling results demonstrate the practical execution of the hierarchical SC
algorithm grounded in inductive approaches for analyzing complex systems
Table 1
The modeling findings for the application of the proposed model in processing GE data
Structure and results CL 1 CL 2 CL 3 CL 4
Gene count 4074 4926 – –
Two-group
clustering Accuracy, % 95 97 – –
Loss 0.254 0.067 – –
Gene count 2487 2549 4964 –
Three-group
clustering Accuracy, % 97 97 100 –
Loss 0.141 0.123 0.058 –
Gene count 1615 2779 4715 891
Four-group
clustering Accuracy, % 97 97 97 95
Loss 0.169 0.142 0.189 0.295
4. Conclusions
The hybrid inductive model for clustering gene expression profiles using spectral clustering has
demonstrated high effectiveness in identifying co-expressed gene subsets. Through a series of
modeling experiments, we observed that the three-cluster structure consistently provided optimal
performance, particularly in terms of classification accuracy and minimizing the loss function
during CNN training. This method allowed for the efficient handling of high-dimensional and
noisy data, which is often characteristic of gene expression datasets.
Our results validate the balance criterion as a robust metric for evaluating clustering quality, as
it harmonizes internal and external clustering measures. Furthermore, the application of CNNs to
gene expression data within clusters showed impressive accuracy, achieving perfect classification
in some cases, confirming the potential of this combined approach for disease diagnosis and gene
analysis.
This study opens new avenues for the practical application of hybrid models in the medical field,
particularly in the diagnosis of complex diseases. The model's robustness to noise and its ability to
produce reliable clustering outcomes highlight its potential for enhancing diagnostic objectivity in
clinical settings. Future research could focus on refining the model by experimenting with different
clustering techniques and expanding the approach to other disease types and datasets
5. Declaration on Generative AI
During the preparation of this work, the authors used ChatGPT and Grammarly exclusively for
grammar and spelling checks, as well as for paraphrasing and rewording. After utilizing these
services, the authors thoroughly reviewed and edited the content as needed and take full
responsibility for the publication's final content.
References
[1] H. Lodish, A. Berk, C.A. Kaiser, et al. Molecular Cell Biology, 9th edition. W.H. Freeman, 2021.
[2] The Cancer Genome Atlas Program (TCGA). National Cancer Institution. Center for Cancer
Genomics, 2024, July, 27, URL: https://www.cancer.gov/ccg/research/genome-sequencing/tcga
[3] S. Yosboon, N. Iam-On, T. Boongoen, P. Keerin, K. Kirimasthong. Optimised multiple data
partitions for cluster-wise imputation of missing values in gene expression data, Expert
Systems with Applications 257 (2024) 125040. doi: 10.1016/j.eswa.2024.125040.
[4] Y. Xu, S. Wang, Q. Feng, et al. scCAD: Cluster decomposition-based anomaly detection for rare
cell identification in single-cell expression data, Nature Communications 15 (1) (2024) 7561.
doi: 10.1038/s41467-024-51891-9.
[5] S. Luo, P.-L. Germain, M.D. Robinson, F. von Meyenn. Benchmarking computational methods
for single-cell chromatin data analysis, Genome Biology 25(1) (2024) 225. doi: 10.1186/s13059-
024-03356-x
[6] Y. Hu, M. Xie, Y. Li, et al. Benchmarking clustering, alignment, and integration methods for
spatial transcriptomics, Genome Biology 25(1) (2024), 212. doi: 10.1186/s13059-024-03361-0.
[7] I. Sakata, Y. Kawahara. Enhancing spectral analysis in nonlinear dynamics with pseudo
eigenfunctions from continuous spectra, Scientific Reports 14(1) (2024) 19276.
doi: 10.1038/s41598-024-69837-y
[8] Y. Liu, X. Lin, Y. Chen, R. Cheng. Multi-order graph clustering with adaptive node-level
weight learning, Pattern Recognition 156 (2024) 110843. doi: 10.1016/j.patcog.2024.110843
[9] S. Babichev, L. Yasinska-Damri, I. Liakh, I. A Hybrid Model of Cancer Diseases Diagnosis
Based on Gene Expression Data with Joint Use of Data Mining Methods and Machine Learning
Techniques, Applied Sciences(Switzerland) 13(10) (2023) 6022. doi: 10.3390/app13106022
[10] S. Babichev, L. Yasinska-Damri, I. Liakh, J. Škvor. Hybrid Inductive Model of Differentially and
Co-Expressed Gene Expression Profile Extraction Based on the Joint Use of Clustering
Technique and Convolutional Neural Network, Applied Sciences(Switzerland) 12(22)
(2022) 11795. doi: 10.3390/app122211795.
[11] M. Romero, O. Ramírez, J. Finke, C. Rocha. Supervised Gene Function Prediction Using
Spectral Clustering on Gene Co-expression Networks, Studies in Computational Intelligence
1016 (2022) 652–663. doi: 10.1007/978-3-030-93413-2_54.
[12] K. Yu, W. Xie, L. Wang, S. Zhang, W. Li. Determination of biomarkers from microarray data
using graph neural network and spectral clustering. Scientific Reports, 2021, 11(1), art.
no. 23828. DOI: 10.1038/s41598-021-03316-6.
[13] J. Liu, S. Ge, Y. Cheng, X. Wang. Multi-View Spectral Clustering Based on Multi-Smooth
Representation Fusion for Cancer Subtype Prediction, Frontiers in Genetics 12 (2021) 718915.
doi: 10.3389/fgene.2021.718915.
[14] S. Babichev, L. Yasinska-Damri, I. Liakh, B. Durnyak. Comparison analysis of gene expression
profiles proximity metrics, Symmetry 13(10) (2021) 1812. doi: 10.3390/sym13101812.
[15] J. Hou, J. Aerts, B. den Hamer, et al. Gene expression-based classification of non-small cell
lung carcinomas and survival prediction, PLoS ONE 5 (2010) e10312. doi:
10.1371/journal.pone.0010312.
[16] Gene Expression Omnibus. 2024, July, 20. URL:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
[17] Q. Zhao, P. Fränti. WB-index: A sum-of-squares based index for cluster validity, Data and
Knowledge Engineering 92 (2014) 77–89. doi: 10.1016/j.datak.2014.07.008.
[18] J. Rojas-Thomas, M. Santos, M. Mora, N. Duro. Performance analysis of clustering internal
validation indexes with asymmetric clusters, IEEE Latin America Transactions 17(5)
(2019) 8891949, 807–814. doi: 10.1109/TLA.2019.8891949