<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>to Support the Automated Alzheimer's Disease Diagnosis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio A. Ardagna</string-name>
          <email>claudio.ardagna@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronica Buttaro</string-name>
          <email>veronica.buttaro@uniba.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Ceci</string-name>
          <email>michelangelo.ceci@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Luzzara</string-name>
          <email>marco.luzzara@unimi.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Pellicani</string-name>
          <email>antonio.pellicani@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianvito Pio</string-name>
          <email>gianvito.pio@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antongiacomo Polimeno</string-name>
          <email>antongiacomo.polimeno@unimi.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Laboratory</institution>
          ,
          <addr-line>National Interuniversity Consortium for Informatics (CINI), Via Volturno 58, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Knowledge Technologies, Jožef Stefan Institute</institution>
          ,
          <addr-line>Jamova 39, Ljubljana, 1000</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Computer Science, University of Bari</institution>
          ,
          <addr-line>Via Orabona 4, Bari, 70125</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Dept. of Computer Science, University of Milan</institution>
          ,
          <addr-line>Via Celoria 18 , Milan, 20133</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>This paper presents the BA-PHERD platform, a secure AI-enabled platform for automated Alzheimer's disease diagnosis using microRNA expression data. The BA-PHERD platform employs a cloud-edge architecture with secure data ingestion, governance-compliant storage, and role-based access control to ensure privacy protection. From a machine learning viewpoint, the platform also introduces a novel patient embedding method that captures regulatory relationships between miRNAs through a correlation network, integrating expression patterns with RNABERT sequence features and miRNA expression values. Unlike traditional approaches treating miRNAs independently, our method constructs a correlation network and applies a GraphSAGE-based approach to generate comprehensive patient representations. Experimental validation on a dataset comprising 1,256 subjects (300 controls, 115 MCI, and 841 AD patients) demonstrates significant improvements, achieving F1-score of 0.66 and an accuracy value of 0.76, over competitor methods.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine learning</kwd>
        <kwd>Secure platform</kwd>
        <kwd>Alzheimer's Disease diagnosis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Machine Learning (ML) techniques have been increasingly applied to analyze biological data, and, in
particular, microRNA (miRNA) expression values. These small non-coding RNA molecules regulate
post-transcriptional gene expression [1] and have emerged as potential disease biomarkers due to
their stability in biological fluids and distinctive expression patterns in conditions such as Alzheimer’s
disease [2], Parkinson’s disease [3], and various cancers [4, 5]. The advancement of high-throughput
sequencing technologies [6] has significantly accelerated miRNA discovery, with findings systematically
documented in public repositories, such as miRBase [7]. This huge amount of data enabled researchers
to employ ML algorithms to identify complex expression patterns for disease diagnosis, and to generate
more accurate biological hypotheses that can be validated through in-vitro experiments [8, 9].</p>
      <p>Despite these advances, applying computational approaches to miRNA data raises significant
challenges: i) the limited availability of high-quality data often results in small sample sizes, leading to
overfitting issues and poor generalization capabilities; ii) most datasets exhibit the ”curse of
dimensionality”, since they collect expression values of numerous miRNAs (features) compared with relatively
few patients (instances), that introduces the need of feature extraction/reduction techniques [10]; iii)
traditional ML approaches [11, 12] frequently treat miRNA expression levels as independent
vari</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
ables, overlooking complex biological interrelationships that could potentially enhance both predictive
accuracy and biological interpretability of the resulting models.</p>
      <p>Beyond methodological aspects, the development of systems to efectively gather and manage
miRNA expression data introduces additional challenges, mostly related to data governance. Indeed,
ensuring both data quality and patient privacy protection requires frameworks that comply with specific
regulations, while maintaining the integrity of the collected data. The sensitive nature of patient-derived
miRNA data requires stringent data protection protocols, particularly when integrated with other clinical
information for disease profiling. These governance considerations become increasingly important as
miRNA-based diagnostic approaches move toward clinical implementation.</p>
      <p>To address these challenges, the project PRIN 2022 - BA-PHERD (Big Data Analytics Pipeline for
the Identification of Heterogeneous Extracellular non-coding RNAs as Disease Biomarkers) proposes a
comprehensive pipeline that enables healthcare practitioners to securely collect, manage, and analyze
patient miRNA data while providing automated diagnostic insights to support clinical decision-making.</p>
      <p>The BA-PHERD pipeline provides end-to-end functionality, from secure data collection and
compliant data storage, to advanced analytical capabilities. The BA-PHERD pipeline includes an AI-based
diagnosis module based on a novel patient embedding method that creates patient representations by
modeling their miRNA expression profiles as a correlation network, leveraging the underlying
biological relationships between these molecules. This network-based approach captures both individual
miRNA expression patterns and inter-miRNA interactions to characterize each patient within a unified
representation. The proposed embedding approach integrates data-driven statistical measures with
structural information derived from the pre-trained model RNABERT [13], providing a comprehensive
characterization of patient-specific miRNA activities in disease contexts. The constructed patient
representations can be subsequently adopted for any downstream tasks, including disease diagnosis
and biomarker discovery.</p>
      <p>We evaluate the proposed approach for diagnosis purposes, using clinical miRNA expression datasets
from the GEO repository, with a particular focus on Alzheimer’s disease (AD). This neurodegenerative
disease exhibits distinct diagnostic challenges for physicians, including the fact that initial symptoms
are often mistakenly attributed to normal cognitive aging, and the limitations of existing diagnostic
procedures, that are invasive, costly, and labor-intensive [14]. The considered dataset includes subjects
with Alzheimer’s disease (AD), Mild Cognitive Impairment (MCI), and healthy controls (CN). Particular
attention is given to the MCI category, as it represents a critical transitional phase that could provide
key insights into early detection and intervention [15].</p>
      <p>The remainder of the paper is organized as follows: Section 2 describes the BA-PHERD platform;
Section 3 presents the experimental setup and discusses the results; Section 4 draws some conclusions
and highlight possible future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The BA-PHERD platform</title>
      <p>In this section, we introduce the BA-PHERD platform. In particular, we first provide a brief overview of
the technological architecture. Subsequently, we describe our AI-based diagnosis module, including the
novel patient embedding strategy.</p>
      <sec id="sec-2-1">
        <title>2.1. Architecture Overview</title>
        <p>In the context under consideration, it is important to note that clinicians and researchers may operate
at diferent paces: clinicians mostly focus on immediate patient care, while researchers work on an
extended timeline, requiring longer periods for data analysis and model development. In addition,
healthcare facilities often lack the computational infrastructure necessary for complex miRNA analysis,
requiring data to be processed in external data centers or specialized facilities. This scenario creates
a critical need for secure data transfer and processing that supports strict privacy requirements, as
sensitive patient information possibly need to cross organizational boundaries.</p>
        <p>User</p>
        <p>Ingestion </p>
        <p>Component
Edge</p>
        <p>Secure Queue</p>
        <p>AI-based 
diagnosis 
module</p>
        <p>Data </p>
        <p>Warehouse
Cloud</p>
        <p>Access 
Control 
System</p>
        <p>To address these challenges, we propose a digital platform that enables secure data sharing between
clinical sites and research teams. The platform employs a flexible anonymization approach to safeguard
patient privacy throughout the entire pipeline and data life cycle. Initially, clinical sites anonymize
patients’ data, for instance, by replacing patient identifiers with persistent unique codes, ensuring that
sensitive information remains under the control of the healthcare provider. Subsequently, each access
to patients’ data is mediated by access control mechanisms that enforce anonymization policies based
on user roles and permissions. For instance, when a researcher requests access to sensitive data, the
system applies additional anonymization measures tailored to the researcher’s access rights.</p>
        <p>Figure 1 shows a high-level overview of the platform architecture, illustrating the division of
responsibilities between the edge and cloud components. The proposed solution is structured around five
essential components, designed to address specific challenges of analyzing miRNA expression data:
i) Ingestion component. It enables the ingestion of miRNA expression data from clinical sources.
ii) Secure queue. It serves as the communication backbone, ensuring reliable transfer of data and
intermediate processing results while maintaining workflow coordination. Its event-driven
architecture allows the edge and cloud components to independently scale and evolve.
iii) Data Warehouse. It provides governance-compliant management of raw and processed data,
supporting storage and structured queries for data retrieval and analysis.
iv) Access control system. It regulates data access through preconfigured roles and permissions, using
a dual approach: resource-based access control (RBAC) that leverages database table structures,
and attribute-based access control (ABAC) that uses metadata tags to enforce detailed access
policies.
v) AI-based diagnosis module. It analyzes miRNA expression profiles exploiting our novel patient
embedding strategy for diagnosis purposes (see Section 2.2 for a detailed description).
The diagram in Figure 1 emphasizes the interaction flow among the five core components of the system.
It highlights how these components collaborate to support machine learning tasks, from secure data
ingestion at the edge to scalable processing and controlled access in the cloud environment.</p>
        <p>The platform is designed to accommodate distinct user roles, each with specific responsibilities and
permissions. In a representative real-world usage scenario, specialist doctors and researchers interact
within the platform as follows.</p>
        <p>Specialist doctors are responsible for uploading patients’ data to the platform, that will ingested
and stored in the Data Warehouse. During this process, they specify the intended scope of the data,
either for training or predictive purposes. The former builds or updates a model using newly acquired
data, while the latter generates diagnostic predictions based on the latest version of the model. Data
accessed from these tasks are filtered and modified by data protection policies, enforced through the
Access Control system. The Data Warehouse automatically applies anonymization strategies, based on
the roles/groups of the user who initiated the task. Two main mechanisms are employed: row-level
ifltering , which ensures that users view only records associated with their own group, and column-level
masking, which anonymizes specific fields depending on the user’s role. For example, the Access Control
system masks personally identifiable information (PII) columns when accessed by a researcher, whereas
it grants full data access to a specialist doctor.</p>
        <p>The results of both types of task are accessible through the platform for a predefined duration,
ensuring timely availability while adhering to data retention policies.</p>
        <p>The proposed approach is based on containerized software architecture, making it compatible with
complex orchestration systems such as Kubernetes, which enables scalable computational power to
handle intensive machine learning workloads. This integrated platform bridges the gap between clinical
data collection and research analysis, while ensuring appropriate privacy safeguards and supporting
the distinct workflows of both clinical and research environments.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. AI-based diagnosis module</title>
        <p>After acquiring and managing the patient data, the BA-PHERD pipeline proceeds to perform automatic
disease diagnosis exploiting a novel patient embedding approach that leverages miRNA correlation
networks and machine learning techniques.</p>
        <p>The first step involves the construction of a correlation network that captures the regulatory
relationships among miRNAs. This network representation serves as the foundation for deriving meaningful
embeddings that reflect both individual miRNA expression patterns and their functional associations.
More specifically, given a dataset of  patients and  miRNAs represented as a matrix  ∈ ℝ × , we
compute a correlation matrix  ∈ ℝ × where each element   represents the Pearson correlation
coeficient between miRNAs  and  . We then assess the statistical significance of each correlation using
the Pearson correlation test, transforming  into an adjacency matrix  where only statistically
significant correlations (p-value &lt; 0.05) are retained. The resulting adjacency matrix  defines our miRNA
correlation network  = ( , ) , where each vertex represents a miRNA and edges represent significant
correlations between pairs of miRNAs. Furthermore, to enrich each node with biological features,
we utilize RNABERT [13], a pre-trained transformer model designed for RNA sequence embedding.
Specifically, for each miRNA  ∈  , we adopt RNABERT to generate a feature vector f that captures
complex nucleotide patterns and contextual information within the miRNA sequence.</p>
        <p>Focusing on the obtained correlation network, it is worth noting that we can find negatively correlated
miRNA pairs. These negative correlations are biologically meaningful, and may provide biological
insights into miRNA regulatory mechanisms. To better handle these correlations, we implement a
dedicated transformation phase. Specifically, we define a new matrix  starting from  and adding, for
each pair (, ) with significant negative correlation   &lt; 0, complementary nodes  ̃ and  ̃ with feature
vectors located, in the embedding space, specularly with respect to the center, namely:
Notably, this transformation preserves negative relationships while maintaining a coherent network
structure that can be processed by downstream graph-based algorithms.</p>
        <p>
          After creating the miRNA correlation network, a key step is to obtain a suitable patient
representation that can be used for automated diagnosis. For this purpose, we start by generating informative
embeddings for each miRNA in the correlation network, employing a variant of GraphSAGE [16], an
We then establish two new positive connections in the adjacency matrix  , specifically:
f ̃ = 1 − f
f =̃ 1 − f
 ,  ̃ = | , |
 ,̃ = | , |
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
inductive learning algorithm for node embeddings. Specifically, since we lack node-level labels for
miRNAs, we adopt a self-supervised training approach that aims to predict the existing links in the
network, formulating the objective function as:
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
treating all neighbors equally:
        </p>
        <p>AGG() ( ) =
ℎ

(−1) + ∑∈ ()</p>
        <p>⋅ ℎ(−1)
1 + ∑∈ ()
 
miRNAs contribute more significantly to the embedding update.
where   represents the correlation strength between miRNAs  and  , ensuring that strongly correlated
Starting with RNABERT features as initial node representations (ℎ(0)
= f ), we apply two GraphSAGE
layers to progressively capture both direct neighborhood correlations and extended two-hop miRNA
relationships within the correlation network. Then, after completing the self-supervised training
process, we systematically discard the complementary nodes  ̃ and  ̃ that were introduced to handle
negative correlations, while their learned representations have already influenced the embeddings of
the original miRNA nodes through the neighborhood aggregation process. This approach ensures
captures both sequence-based features and network topology information.
that the biological insights from negative correlations are preserved in the final embeddings without
artificially expanding the feature space. This process results in an embedding matrix  ∈ ℝ × , where
each row ℎ represents the learned embedding for miRNA  , and  is the embedding dimension that</p>
        <p>Finally, to generate patient-level embeddings that efectively capture individual disease signatures,
we first normalize each row of the original expression matrix  to account for varying sequencing
depths and total RNA content across patients:
ℒ = −</p>
        <p>∑
(,)∈
pos
log( (ℎ  ⋅ ℎ )) −</p>
        <p>log(1 −  (ℎ  ⋅ ℎ ))
∑
(,)∈ neg
nodes  and  , respectively, and  is the sigmoid function.
where  pos is the set of nodes that are connected by edges in  or co-occur within a fixed-length
random walk,  neg is a set of randomly sampled disconnected nodes, ℎ and ℎ are the embeddings for</p>
        <p>GraphSAGE generates node embeddings through iterative neighborhood aggregation, where each
node’s representation is updated by combining information from its local neighborhood structure. For
each node  ∈  , the  -th layer update is computed as:
ℎ
() =  ( () ⋅ AGG() ({ℎ

(−1) } ∪ {ℎ
(−1) | ∈  ( )}
))
where  () is the learnable weight matrix at layer  ,  is an activation function, and  ( )
denotes the
neighborhood of node  . To efectively capture the varying importance of correlations in our miRNA
network, we implement a weighted mean aggregator that accounts for correlation strengths rather than
 ̃ =
 

∑</p>
        <p>=1  
 =  ⋅̃ 
This row-wise normalization ensures that each patient’s expression profile sums to unity, efectively
controlling for technical variations in total expression magnitude. Subsequently, we project each patient
into the learned miRNA embedding space through a weighted linear combination that leverages both
the patient’s expression profile and the network-informed miRNA representations:
representation, weighted by the patient’s normalized expression levels.
where each element   represents the contribution of miRNA embedding dimension  to patient  ’s</p>
        <p>The resulting patient embedding matrix  ∈ ℝ × contains representations for all  patients,
efectively integrating three key sources of information: individual patient expression patterns, miRNA
sequence characteristics from RNABERT, and topological relationships captured from the correlation
network structure. Furthermore, to enhance diagnostic accuracy and create a more comprehensive patient
representation, we integrate these embeddings with relevant clinical variables, including demographic
data (age and sex) and Apolipoprotein E (ApoE) genotype values. ApoE variants are well-established
biomarkers associated with various neurodegenerative and cardiovascular conditions, making them
particularly valuable for complete disease characterization.</p>
        <p>For the final diagnostic classification, we employ a Random Forest (RF) classifier [ 17] with the
Gini impurity heuristic. RF can naturally handle the integration of heterogeneous features (through
concatenation) from both embeddings and clinical variables, without requiring extensive preprocessing
or feature scaling steps. This ensemble method is particularly well-suited, as it can efectively capture
complex interactions between miRNA-derived features and clinical indicators.</p>
        <p>
          In summary, this module integrated in the BA-PHERD platform ofers several key advantages over
conventional miRNA-based diagnostic approaches: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) it captures the complex regulatory relationships
between miRNAs, rather than treating them as independent features, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) it integrates expression-based
and sequence-derived information, providing a more comprehensive representation of miRNA biology,
and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) it seamlessly combines molecular data with clinical variables for enhanced diagnostic accuracy.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>To evaluate the performance of the BA-PHERD platform, we conducted some experiments using a dataset
obtained from NCBI Gene Expression Omnibus (GEO)1, a comprehensive repository for microarray and
RNA-seq experimental data. The considered dataset contains diferent AD studies, namely GSE120584,
GSE150693 and GSE242923. The selected studies include 300 control subjects (CN), 115 patients with
mild cognitive impairment (MCI), and 841 patients with Alzheimer’s disease (AD). Inconsistencies in
microRNA identifiers across diferent studies needed some standardization steps to guarantee a reliable
analysis: i) eliminating duplicate identifiers within each dataset to ensure unique representation of
microRNAs, ii) removing whitespaces to maintain consistent formatting, iii) truncating version-specific
sufixes (e.g., v2 from miR-123a-3p-v2); and iv) mapping all identifiers to the latest version as reported
in the miRBase repository.</p>
      <p>We evaluated model performance through a stratified 5-fold cross-validation on the integrated dataset,
ensuring that each fold preserves the original class proportions of the diagnostic categories (CN, MCI,
AD). For each fold, we collected precision, recall, and F1-score measures for individual classes, then
computed the overall performance by averaging these metrics across folds and classes. To ensure equal
importance is given to all diagnostic categories regardless of class size, we report macro-averaged
results, which prevents the larger AD group from dominating the performance evaluation.</p>
      <p>
        For the automated disease diagnosis, we employed a Random Forest classifier with 100 trees using
the scikit-learn 1.4 implementation. To comprehensively evaluate the contribution of each view of the
patients, we performed experiments across multiple combinations of the following features sets:
• the original miRNA expression data (henceforth denoted as expr);
• patient metadata, that contain clinical information, namely, age, sex and ApoE (henceforth
denoted as meta);
• the features identified through the embedding approach we propose in Section 2.2 - Equation (
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(henceforth denoted as emb).
      </p>
      <p>All the considered combinations include the emb view, since it is the core of the proposed method. On
the other hand, we compare the obtained results with those achieved by the approach proposed in [18],
which is mainly based on the adoption of a Random Forest (RF) or a Multi-Layer Perceptron (MLP) on
the concatenation of the views expr and meta.</p>
      <p>Method</p>
      <sec id="sec-3-1">
        <title>3.1. Results and Discussion</title>
        <p>
          In Table 1 we show the obtained results. From the table, we can observe that the competitor based on
RF achieves a maximum F1-score of 0.48 with the expr view, while the MLP variant obtains the same
result when also including the meta view. Both variants sufer from low average recall (0.44-0.47 for RF,
0.40-0.50 for MLP). On the other hand, the embedding strategy implemented in the BA-PHERD platform
demonstrates substantial improvements over the competitors. Indeed, the emb view alone significantly
outperforms all the results obtained by competitors, achieving an F1-score of 0.61 and an accuracy value
of 0.73, representing a 27% improvement in F1-score compared to the best result obtained by competitors
(0.48). The combination of multiple views in our approach leads to even superior performance, with
the expr-meta-emb combination achieving the highest F1-score of 0.64 and an accuracy value of 0.76,
outperforming the F1-score of the best competitor by 33%. The meta-emb combination (F1: 0.65, Acc:
0.75) demonstrates that clinical metadata and the features extracted through our embedding strategy
can already reach the best performances in the diagnosis of AD. This is somehow expected because
expression data are exploited in the computation of the final feature set (see Equation (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )). It is important
to note that, with these two combinations (meta-emb and expr-meta-emb), performances in terms of
precision and recall appear balanced (precision in the interval 0.80-0.84, recall in the interval 0.59-0.60),
suggesting potential clinical utility.
        </p>
        <p>The consistent improvement observed when including the emb view highlights the value of the
proposed patient embedding approach that captures expression patterns through i) the original miRNA
expression values, ii) regulatory relationships via correlation-based miRNA interactions, and iii)
sequencelevel information through the features extracted by RNABERT.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we presented the BA-PHERD platform, a result of the PRIN 2022 project BA-PHERD
Big Data Analytics Pipeline for the Identification of Heterogeneous Extracellular non-coding RNAs as
Disease Biomarkers. This framework addresses the critical need for accurate and early diagnosis of
Alzheimer’s disease through the integration of biological data, a novel patient embedding approach and
machine learning techniques.</p>
      <p>Our approach uniquely captures expression patterns through the original miRNA expression data,
regulatory relationships via correlation-based miRNA interactions, and sequence-level information through
the RNABERT method. The proposed methodology achieves significant performance improvements,
demonstrating a substantial enhancement in F1-score compared to existing methods.</p>
      <p>The complete framework encompasses end-to-end capabilities for gathering, managing, processing,
and analyzing both miRNA expression and clinical data, while maintaining strict compliance with
patient privacy criteria and regulatory requirements. We are currently working on validating the
whole platform, involving actual clinicians, researchers and new patients, whose data are collected and
processed directly in the hospitals. For future work, we plan to incorporate explainability approaches
to identify candidate miRNA biomarkers that contribute most significantly to Alzheimer’s disease
diagnosis, that will be subsequently validated through in-lab experiments.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the European Union - NextGenerationEU through the Italian
Ministry of University and Research, Projects PRIN 2022 ”BA-PHERD: Big Data Analytics Pipeline for
the Identification of Heterogeneous Extracellular non-coding RNAs as Disease Biomarkers”, grant n.
2022XABBMA, CUP: H53D23003690006, CUP: G53D23002910006, and CN3 RNA - ”National Center for
Gene Therapy and Drugs based on RNA Technology”, CUP: H93C22000430007.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors used Grammarly for grammar and spelling check. The authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[12] N. Gilani, R. Arabi Belaghi, Y. Aftabi, E. Faramarzi, T. Edgünlü, M. H. Somi, Identifying potential
miRNA biomarkers for gastric cancer diagnosis using machine learning variable selection approach,
Frontiers in genetics 12 (2022) 779455.
[13] M. Akiyama, Y. Sakakibara, Informative RNA base embedding for RNA structural alignment and
clustering by deep representation learning, NAR genomics and bioinformatics 4 (2022) lqac012.
[14] B. P. Leifer, Early diagnosis of alzheimer’s disease: clinical and economic benefits, Journal of the</p>
      <p>American Geriatrics Society 51 (2003) S281–S288.
[15] M. N. Sabbagh, M. Boada, S. Borson, M. Chilukuri, B. Dubois, J. Ingram, A. Iwata, A. Porsteinsson,
K. Possin, G. Rabinovici, et al., Early detection of mild cognitive impairment (MCI) in primary
care, The Journal of prevention of Alzheimer’s disease 7 (2020) 165–170.
[16] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, Advances in
neural information processing systems 30 (2017).
[17] X. Chen, H. Ishwaran, Random forests for genomic data analysis, Genomics 99 (2012) 323–329.
[18] D. Rosa, A. Pellicani, G. Pio, D. D’Elia, M. Ceci, Exploiting microRNA Expression Data for
the Diagnosis of Disease Conditions and the Discovery of Novel Biomarkers, in: International
Symposium on Methodologies for Intelligent Systems, Springer, 2024, pp. 77–86.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>J. O'Brien</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hayder</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zayed</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Peng, Overview of microrna biogenesis, mechanisms of actions, and circulation</article-title>
          ,
          <source>Frontiers in Endocrinology 9</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <article-title>Micrornas and complex diseases: from experimental results to computational models</article-title>
          ,
          <source>Briefings in bioinformatics 20</source>
          (
          <year>2019</year>
          )
          <fpage>515</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>Identification of a panel of five serum mirnas as a biomarker for parkinson's disease</article-title>
          ,
          <source>Parkinsonism &amp; Related Disorders</source>
          <volume>22</volume>
          (
          <year>2016</year>
          )
          <fpage>68</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Carter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Galbraith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galandiuk</surname>
          </string-name>
          ,
          <article-title>Blood-based micrornas as biomarkers for the diagnosis of colorectal cancer: a systematic review and meta-analysis</article-title>
          ,
          <source>British journal of cancer 116</source>
          (
          <year>2017</year>
          )
          <fpage>762</fpage>
          -
          <lpage>774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Schrauder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Strick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schulz-Wendtland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Strissel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kahmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Loehberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hein</surname>
          </string-name>
          , et al.,
          <article-title>Circulating micro-rnas as potential blood-based markers for early stage breast cancer detection</article-title>
          ,
          <source>PloS one 7</source>
          (
          <year>2012</year>
          )
          <article-title>e29770</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Next-generation sequencing for microrna expression profile</article-title>
          , Bioinformatics in microRNA research (
          <year>2017</year>
          )
          <fpage>169</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Grifiths-Jones</surname>
          </string-name>
          ,
          <article-title>mirbase: the microrna sequence database</article-title>
          ,
          <source>MicroRNA Protocols</source>
          (
          <year>2006</year>
          )
          <fpage>129</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vishnoi</surname>
          </string-name>
          , S. Rani,
          <article-title>mirna biogenesis and regulation of diseases: an updated overview, MicroRNA profiling: methods and protocols (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Loglisci</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. D'Elia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Malerba</surname>
          </string-name>
          ,
          <source>Hierarchical and Overlapping Co-Clustering of mRNA: miRNA Interactions, in: ECAI 2012 - 20th European Conference on Artificial Intelligence. Including Prestigious Applications of Artificial Intelligence (PAIS-2012) System Demonstrations Track</source>
          , Montpellier, France,
          <source>August 27-31</source>
          ,
          <year>2012</year>
          , volume
          <volume>242</volume>
          <source>of Frontiers in Artificial Intelligence and Applications</source>
          , IOS Press,
          <year>2012</year>
          , pp.
          <fpage>654</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wirth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Çakir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Binder</surname>
          </string-name>
          ,
          <article-title>Analysis of microrna expression using machine learning</article-title>
          ,
          <source>miRNomics: MicroRNA Biology and Computational Analysis</source>
          (
          <year>2014</year>
          )
          <fpage>257</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Azari</surname>
          </string-name>
          , E. Nazari,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mohit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asadnia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maftooh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nassiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Hassanian</surname>
          </string-name>
          , M. GhayourMobarhan, S. Shahidsales,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khazaei</surname>
          </string-name>
          , et al.,
          <article-title>Machine learning algorithms reveal potential mirnas biomarkers in gastric cancer</article-title>
          ,
          <source>Scientific reports 13</source>
          (
          <year>2023</year>
          )
          <fpage>6147</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>