-

1613-0073

analysis of longitudinal data of patients with dementia through unsupervised techniques

Patrizia Ribino

Claudia Di Napoli

Giovanni Paragliola

Luca Serino

Francesca Gasparini

0 4

Davide Chicco

davidechicco@davidechicco.it 0 1 4 0 Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca , Milan , Italy 1 Institute of Health Policy Management and Evaluation, University of Toronto , Toronto, Ontario , Canada 2 Istituto di Calcolo e Reti ad Alte prestazioni, Consiglio Nazionale delle Ricerche (CNR) , Naples , Italy 3 Istituto di Calcolo e Reti ad Alte prestazioni, Consiglio Nazionale delle Ricerche (CNR) , Palermo , Italy 4 NeuroMI, Milan Center for Neuroscience, Università di Milano-Bicocca , Milan , Italy

0000 0001

Dementia is a set of mental diseases afecting millions of people worldwide. Similarly to all the other mental health issues, it is often dificult to forecast the trend of the disease for patients sufering from it. In this context, data of patients sufering from mental health are usually collected through questionnaires, psychological and cognitive tests, over several timepoints. This way, longitudinal data can help identify disease trajectories and allow medical doctors to forecast specific treatments. In this study, we analyze an open, unrestricted dataset of electronic health records (EHRs) of patients sufering from dementia, called OASIS-2, through several unsupervised machine learning methods ( means, Hierarchical Clustering, Gaussian Mixture Model, and Spectral Clustering). This dataset contains demographic data and psychological test data collected over five independent visits, and having 142 patients at the first visit and ten features. Our goal is to identify patients' clusters that stay stable over the ifrst four visits (we discarded the data of the fith visit because of its small size) these clusters by studying their variables. We also measure the performances of the clustering methods through conventional metrics for internal and external validation. Our preliminary results show that unsupervised techniques can identify significant clusters of patients with mental health issues in this dataset and that Hierarchical Clustering outperforms the other algorithms to this end. dementia, mental health, clustering, unsupervised machine learning, electronic health records, older Dementia is the generic name of a set of health issues regarding patients' mental states, including Alzheimer's disease, Parkinson's disease, dementia with Lewy bodies and others. Diferently from other body diseases, such as breast cancer, patients with dementia cannot undergo a ∗Corresponding author.

adults

CEUR ceur-ws.org

1. Introduction

CEUR Workshop Proceedings surgical operation, and therefore, it is dificult to treat and even recognize patients sufering from this disease.

Diagnosis of dementia can be done through computed tomography (CT) scan or magnetic resonance imaging (MRI), but these bioengineering techniques require expensive machines that are sometimes unavailable in hospitals. In this context, diagnosis of dementia is often found through cognitive tests: patients are asked to answer some questions and some tests, in a written, oral, or computerized form, and their answers are recorded to generate a cognitive test score. The most common cognitive tests are the Clinical Dementia Rating (CDR) [ 1 ] and the Mini Mental State Examination (MMSE) [ 2 ].

Results of these cognitive tests taken only once do not say much about the mental condition of the patient; therefore, to have a reliable diagnosis, it is necessary to ask the patient to undergo these cognitive exams multiple times, at diferent time points ( for example, once every 90 days, one year, or two years). This way, the cognitive decline and the patient’s diagnosis can be understood more clearly.

Longitudinal data collected through these cognitive tests can be included in electronic health records (EHRs), which in turn can be used for scientific analyses. Unsupervised machine learning approaches can help identify clusters of patients having a similar trend over time.

In this study, we analyzed a public dataset derived from EHRs called OASIS-2, which contains data on patients sufering from dementia collected at 5 visits. We applied a set of clustering techniques ( -means, Hierarchical Clustering, Gaussian Mixture Models, and Spectral Clustering) to identify groups of patients following the same trends over time. Eventually, we investigated the features of these clusters and identified some significant clusters where patients share common traits. Our results show that unsupervised methods can be efective in identifying meaningful groups made of patients sufering from dementia and of healthy individuals, and can have a strong impact on medical practice: once our results are confirmed and validated on another dataset, physicians will be able to use our findings to associate a new patient with one of our clusters, after collecting their cognitive tests’ results Medical doctors will be able to employ this information to design a better treatment.

Predicting dementia trends is useful for medical and economic reasons: patients sufering from dementia, in fact, need constant assistance through home care facilities or hospitals, which can become expensive. Understanding the semantics of relevant patients’ clusters, especially in the ageing society, can be useful for medical and economic reasons. We designed this study also to investigate which clinical factors can better characterize each cluster, towards the framework of the “minimal electronic health record” [ 3, 4, 5 ].

Literature review Electronic health records (EHRs) analysis represents an unprecedented source of information for health-related applications, ranging from epidemiological monitoring of population diseases to treatment improvement and clinical research [ 6 ]. These new data sources may ofer new insights into the underlying heterogeneity of dementia, one of the main causes of Alzheimer’s disease. Nevertheless, accurate analytic models from EHR data are challenging due to data quality, data and label availability, and heterogeneity of data types.

Traditional health analytics modelling relies on expert-defined phenotyping and ad-hoc feature engineering, leading to models that present limited generalizability across diferent datasets. Machine learning shifted the modelling paradigm from expert-driven feature engineering to data-driven feature construction [ 7 ]. Several works propose supervised learning methods to discover interconnections between diseases, predict the health status of patients, and prevent diseases. Considering the lack of standardized instruments for detecting dementia, it is crucial to develop methods to predict in advance the personalized risk of dementia to prevent it.

Deep Feedforward Networks which are Multi-Layer Perceptron models (MLP1 and MLP2), and a Convolutional Bidirectional Long Short-Term Memory were compared in predicting the risk of dementia using the Alzheimer’s Disease Neuroimaging Initiative data [ 8 ]. The proposed models identify diferent patterns for Dementia, Minor Cognitive Impairment, and Cognitive Normal classes, but lots of preprocessing on the available data was required.

In the [9] study, authors analyse the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset with supervised machine learning techniques to predict future disease states, considering only data from non-invasive measurements derived from blood tests.

In [10], logistic regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), random forest (RF), and eXtreme Gradient Boosting (XGBoost) algorithms were used to identify probable AD and related dementia subphenotypes using routinely collected data from EHRs.

Others propose unsupervised learning methods to identify new patterns among no-labelled sampled data that can be used to predict the possible evolution of a disease. One of the most important unsupervised learning techniques is clustering, which can help discover patterns and structures in labelled and unlabeled datasets, allowing the distribution of dementia patients into subtypes based on key features recorded in the EHR. Moreover, clustering algorithms can ifnd patterns dificult to detect even by specialized medical doctors. In [ 11], they propose the Poisson Dirichlet Model (PDM), an unsupervised generative probabilistic model based on the Latent Dirichlet Allocation (LDA) to discover latent disease clusters and to stratify patients into subgroups with similar characteristics and risk factors. The proposed method identifies latent comorbidities that provide additional information on the risk factors of developing the disease other than those correlated to age and sex. Diferent clustering algorithms, -means, kernel -means, afinity propagation and latent class analysis, were employed in [ 12] to identify subtypes of Alzheimer’s disease from EHRs. Diferent clusters were found with each clustering method, and one particular cluster resulted in three out of the four adopted clustering methods, suggesting the plausible presence of a specific disease subtype.

A multi-layer clustering algorithm was proposed in [13] to construct clusters of late MCI subjects. Clusters of slow and rapidly declining subjects within the category of late MCI were identified, showing pathological diferences that suggest the need to subclassify late MCI subjects further.

A two-stage clustering analysis [14] was applied to individuals with or without dementia to identify subsets on which profiling is carried out according to features reported in the Alzheimer’s features dataset. The analysis provided distinct patterns in the characteristics of dementia patients within the diferent clusters regarding their sex, socio-economic status, age, education level, and cognitive status.

Most studies difer in the dimensions used by the clustering algorithms, the dataset used, and the variables and groups included in the datasets [15]. In addition, these studies based on EHRs obtained with cognitive tests and brain scans do not consider the progression of dementia. Hence, a longitudinal dimension to clustering is necessary to improve the identification of risk factors for the future prognosis of the disease.

We organize the rest of the article as follows. After this Introduction, we describe the analyzed dataset in section 2 and the unsupervised machine learning methods we use in section 3. Aftwerwards, we report our results in subsection 3.3, and discuss them by outlining some conclusions in section 4.

2. Dataset

The dataset used in this paper was derived from the Open Access Series of Imaging Studies (OASIS 2) with longitudinal MRI data [16, 17], and publicly released online [18, 19]. The people in the OASIS 2 study were chosen from a group of individuals who had taken part in magnetic resonance imaging (MRI) longitudinal studies at the Washington University Alzheimer’s Disease Research Center (ADRC). The choice was made based on the requirement of having at least two separate visits where both clinical and MRI data were gathered. The project collected data from MRI scans of 150 people between the ages of 60 and 96 years through a follow-up period. Together with these magnetic resonance images, the curators of the dataset collected data of the psychological tests taken by the patients and on their socio-economic status. The mean follow—up time for the cohort is 2.91 (±0.01) years. Sample size significantly attenuates with the number of visits. For this study, the single response variable was the Clinical Dementia Rating (CDR), developed to measure dementia severity [ 1 ]. A CDR value of 0 indicates no dementia, and CDRs of 0.5, 1, 2, and 3 represent very mild, mild, moderate, and severe dementia, respectively. The clinical diagnosis of AD was determined as a ≥ 0.5 [16].

Moreover, OASIS 2 dataset provides several independent variables from the patient sociodemographic characteristics such as Sex (Female, Male), Age within the range of [60, 96] years at the first visit, Education within the range of [ 6, 23 ] years, Socioeconomic Status (SES) (1 = lower, 2 = lower middle, 3 = middle, 4 = upper middle, 5 = upper). Moreover, clinical predictor variables were also available. These variables include the Mini-Mental State Exam (MMSE), the Atlas scaling factor (ASF), the estimated total intracranial volume (eTIV), and the normalized whole brain volume (nWBV).

The MMSE is a 30-point questionnaire with 30 questions covering arithmetic, memory, and orientation to examine the cognitive situations of individuals. The Estimated Total Intracranial Volume (eTIV) estimates intracranial brain volume. The normalized Whole Brain Volume (nWBV) measures the volume of the whole brain. Finally, the ASF is a one-parameter scaling factor that allows for comparison of the estimated total intracranial volume (eTIV) based on diferences in human anatomy. The explanation of the features of this dataset can be found in Table 3 of the [20] study.

For the analysis conducted in this work, we considered the samples gathered during the first four visits, and then reported the results for the first three, since they were the most meaningful. Moreover, the patients whose not all variables are present have been excluded from the analysis. Thus, the number of patients analysed in this work has been reduced to 142. Particularly, the sample size at the time of the first visit comprises 86 healthy subjects and 56 cognitively impaired patients. In this cohort, there were 84 (that is, 59%) women and 58 (that is, 41%) men. The subjects for the clustering were input without considering the subject-ID, MRI-ID, and visit age

3. Methods 3.1. Our approach

We downloaded the dataset from Mendeley Data [19] as a comma-separated value (CSV) file of 27.8 kilobytes (kB). We split the dataset into 5 subsets, each corresponding to a single visit. Since the fith visit subset contains data from only 6 patients, we discarded it: we used the minimal number of 10 subjects per visit as heuristic threshold for our analysis. We removed the MR‘Delay variable because it has identical values for the first visit.

We also removed the diagnosis variables CDR and dementia group to utilize them as validation targets. After clustering patients’ data without these two variables, we study how the clusters are characterized with respect to these two factors. For each cluster, we study the average CDR value and the distribution of the dementia group among the patients.

We then applied the -means and Hierarchical Clustering methods on the first visit subset to optimize the number of clusters, resulting in = 3 in both cases measured through the Silhouette score. We, therefore, decided to use = 3 clusters for all the visits’ subsets and algorithms to keep our analyses consistent.

Our goal is to find clusters of patients that are stable both through the four visits and through the clustering techniques and to analyze the characteristics and traits of these clusters based on the dataset variables (Table 1).

3.2. Clustering techniques

For the aims of this paper, four types of clustering algorithms have been adopted: -means [21], Gaussian Mixture Models [22], Spectral Clustering [23], and Hierarchical Clustering [24]. We selected the most popular clustering techniques for four diferent clustering approaches [ 25]: partition-based ( -means), based on hierarchy (hierarchical clustering), based on algebraic graph theory (spectral clustering), and based on Gaussian distribution (Gaussian mixture models with posterior probability calculated thorugh expectation-maximization).

We implemented our scripts by using the Python and R open source programming languages, employing the scikit-learn software library of Python and the NbClust, sClust, FCPS, and table1 packages in R.

Silhouette Score [−1, +1] 0.189937 0.180429 0.187361 0.202786

Davies-Bouldin Score [0, +∞] 1.795906 1.539607 2.041673 1.726597

Calinski Harabasz Score [0, +∞] 33.976650 21.358736 28.684796 24.466958 -means Spectral Clustering

Gaussian mixture Hierarchical Clustering

-means Spectral Clustering

Gaussian mixture

Hierarchical Clustering

Table 3 and Table 2 show a comparative analysis of the results obtained by the diferent clustering algorithms employed from a purely analytical point of view. It is worth noting that both tables refer specifically to patients at their first visit, as this visit includes a larger number of patients than subsequent visits.

3.3. Results

The following section provides a comprehensive analysis of the clustering results achieved by each adopted algorithm. The assessment of the adopted methods in clustering subjects relies on applying the CDR and the dementia group as validation metrics.

Gaussian Mixture Model Clustering Figure S1 reports the statistical results concerning the patients across the three separate clusters determined by applying the Gaussian Mixture model for each visit throughout the longitudinal study. By looking at the clustering results obtained on patients examined during the first visit, it is possible to observe that a cohort of 37 individuals has been assigned to Cluster 1. This group exhibits an average age of 79.38 years, an average educational status of 12.38 years, and tends to possess a moderate socio-economic status (3.27) on average. Specifically, the patients grouped within Cluster 1 exhibit the highest average age, the most elevated socio-economic status, and the lowest education level compared to patients belonging to Cluster 2 and Cluster 3. Moreover, Cluster 1 exhibits the lowest MMSE score (24.7) among the three clusters, along with the lowest average value of nWBV. It is worth mentioning that approximately 70% of the individuals in this cluster have been identified as having mild or moderate cognitive decline, as evidenced by the value of CDR score greater than zero, and they have been targeted as Demented (Figure S1). Notably, a higher prevalence of dementia is observed among males than their female counterparts.

On the contrary, Cluster 3 consists of a cohort of 79 individuals, with a notable predominance of females. This cluster is characterized by having the lowest age (73.94), the lowest eTIV (1393.34), the highest MMSE (28.86), as well as the highest nWBV (0.75) on average. The subjects within this cluster exhibit a higher education level than individuals in Cluster 1, notwithstanding a marginally lower socioeconomic status. Furthermore, it is worth mentioning that a considerable majority (74.7%) of patients in this particular cluster were initially identified as exhibiting normal cognitive functioning during their initial visit, as indicated by a Clinical Dementia Rating score of 0. In particular, the prevailing demographic trend within the non-demented patients suggests a greater prevalence of females. However, in this cohort of healthy patients, a proportion of 18.6% (n=11) of individuals experienced conversion throughout the study period, with a marked predominance of female patients (Table S1).

Finally, Cluster 2 comprises 26 individuals, mainly males. The average age of this cluster aligns closely with that of Cluster 3 (i.e., 74.38 compared to 73.94). Cluster 2 presents the highest education level and eTIV but the lowest socio-economic status. The level of MMSE is similar to Cluster 3. The distribution of healthy and demented subjects is very similar. All patients diagnosed with dementia and those who have undergone a conversion are exclusively male.

Regarding the outcomes of the second visit, it is evident that Cluster 3 exhibits analogous characteristics to those observed in Cluster 1 pertaining to the first visit. This outcome can be ascribed to the observation that a substantial majority (97%) of patients from the first visit, who were initially categorized as Cluster 1, have been assigned to Cluster 3 of the second visit. Furthermore, there has been an increase in the average age of the patients since they attended their second visit at least two years after their initial one.

Similarly, Cluster 1 observed for the second visit exhibits comparable characteristics to Cluster 3 associated with the first visit. Cluster 1 comprises a significant majority of subjects, specifically 85%, who are grouped within Cluster 3 of the first visit. Additionally, it was observed that 75.8% of the patients grouped under Cluster 2 of the second visit were also grouped under Cluster 2 of the first visit. Similar grouping characteristics are evident for the third visit.

A mere fraction of patients enrolled in the longitudinal study were observed to have attended the fourth scheduled visit. Due to the limited sample size, this grouping is deemed unfit to be considered representative.

Ultimately, the findings obtained using the Gaussian Mixture model suggest that the clusters exhibiting the highest degree of representation among individuals with cognitive impairments throughout the longitudinal study are Visit 1 Cluster 1 (V1C1), Visit 2 Cluster 3 (V2C3), and Visit 3 Cluster 1 (V3C1). On the contrary, the clusters denoted as V1C1, V2C1, and V3C3 have been identified as the most indicative clusters within the population of healthy subjects. -means Clustering The statistical data about patients categorized into three distinct clusters identified by the -means model across multiple visits during the longitudinal study are illustrated in Figure S2. By looking at the clustering results related to the patients examined during the first visit, it can be observed that Cluster 1 comprises a total of 53 patients, with 88.68% women and only 11% men. The average age of individuals is 71.96 years, with an education level of 15.26 years and a low-middle socio-economic status (2.17) on average. Cluster 1 efectively consolidates individuals with noticeably younger ages, with the highest MMSE quantified at 29.15 and the most substantial nWBV recorded at 0.76. Furthermore, it should be noted that a significant proportion of patients identified within this particular cluster, specifically 83%, were initially diagnosed as being within normal parameters during their initial visit. All individuals who experienced conversion during the subsequent visits are identified as female based on the Group label in Figure S2.

Conversely, Cluster 3 comprises 47 individuals, where approximately 68% of the cluster’s population are men, while women constitute the remaining 32%. Cluster 3 has the highest average age (77.85), the highest socio-economic status (3.45), and the lowest education level (12.13) compared to the other clusters of the same visit. Moreover, Cluster 3 highlights the lowest MMSE score (i.e., 26.13) among the three clusters and a lower average value of nWBV than Cluster 3 but the same of nWBV compared to Cluster 2. By looking at CDR, almost 64% of this cluster’s patients have been diagnosed with mild or moderate cognitive decline (i.e., > 0 ) and have been targeted as Demented. The majority of patients who do not exhibit symptoms of dementia are female.

Finally, Cluster 2 encompasses 42 subjects, of which 88% are male and only 12% are female. This cluster comprises individuals whose average age closely aligns with that of Cluster 3. The results demonstrate that, on average, individuals in this cluster possess the highest level of education (16.69) and the highest eTIV value (1676.33). However, they exhibit the lowest socioeconomic status (1.83). The cognitive performance as measured by the MMSE is marginally superior compared to Cluster 3. The distribution of individuals exhibiting well-preserved cognitive abilities and those aflicted by dementia demonstrates a nearly equivalent distribution.

The findings from the second visit reveal that Cluster 2 exhibits analogous characteristics observed for Cluster 1 related to the first visit. Patients transition from Cluster 1 to Cluster 2 during the second visit is observed in 92% of the cases. Similarly, the statistics of Cluster 3 observed for the second visit demonstrate similarities when compared to the statistics of Cluster 3 of the first visit. Comparable grouping characteristics are also evident for the third visit.

Hence, the clusters V1C3, V2C3, and V3C3 can potentially be identified as the most representative cohorts exhibiting cognitive impairment. On the contrary, when assessing individuals in good health, results suggest that the clusters V1C1, V2C2, and V3C1 exhibit the utmost degree of representativeness.

Hierarchical Clustering The table in Figure S3 presents the statistical information about the patients categorized into three distinct clusters based on the Hierarchical Clustering model, as observed during each visit within the longitudinal study. An analysis of the clustering outcomes of the individuals evaluated during their initial visit reveals that Cluster 1 encompasses a cohort of 60 patients, out of which 91.67% are female, and a mere 8% are male. The average age of individuals is 74.58 years, with an average education level of 15.4 years and a low-middle socio-economic status (2.12). Cluster 1 groups individuals with the highest MMSE (29.12) as well as the highest nWBV (0.76). Individuals in Cluster 1 exhibit a comparatively higher education level than individuals within Cluster 3, while their socio-economic status demonstrates a slight decrease. Moreover, the findings indicate that a substantial majority of individuals (86.7%) within this cluster showed a healthy diagnosis during their initial visit, as evidenced by a CDR score of 0. Most individuals diagnosed as non-demented within the cluster exhibit female sex. All converted individuals exclusively pertain to the female sex (group label in Figure S2).

On the contrary, Cluster 3 comprises 54 individuals, with women accounting for 54% and men representing 46% of the group. The average age of this cluster is 76.89 years, with an education level of 12.56 years. Additionally, the average socio-economic status of this cluster is moderately middle-class, with a value of 3.41. Mainly, the patients in this cluster have the highest average age and the lowest education level (12.56) compared to the other clusters. Moreover, Cluster 3 highlights the lowest MMSE score (i.e., 26.04) and the lowest nWBV value (0.72) among the three clusters. A thorough analysis of the CDR reveals that 66.7% of the patients within this particular cluster exhibit mild to moderate cognitive decline, as indicated by a > 0 . The Group label in Figure S2 indicates these individuals as Demented.

Lastly, Cluster 2 comprises a cohort of 28 male patients. This grouping comprises individuals with a comparable average age with Cluster 1. The members within this cluster exhibit the most elevated average level of education and the highest eTIV, whereas their socioeconomic status (1.54) is relatively low. The MMSE level (27.5) exhibits a marginal increment compared to Cluster 3 (26.3). The demographic distribution of individuals classified as healthy and those diagnosed with dementia exhibits striking similarity.

Shifting focus towards the outcomes of the second visit, it is evident that Cluster 3 exhibits analogous characteristics when juxtaposed with Cluster 1 about the first visit. Analogously, the second visit unveils analogous statistical characteristics between Cluster 1 and the first visit’s Cluster 3. Comparable grouping characteristics are also observed for the third visit.

Therefore, results suggest that V1C3, V2C1, and V3C1 may be regarded as the most indicative clusters of subjects with cognitive impairment. Similarly, V1C3, V2C3, and V3C3 can be perceived as the most representative clusters of individuals who exhibit sound cognitive function. Spectral Clustering Table S1, Table S2, Table S3, and Table S4 present the statistical characteristics of the patients assigned to the three distinct clusters identified by the Spectral Clustering model for every visit conducted in the course of the longitudinal study.

After analyzing the clustering analysis results of the patients that undertaken the first visit, it was observed that a group denoted as Cluster 1 includes a total of 83 female patients. The average age of individuals in this cluster is 75.7 years, with an average education level of 14.5 years and a middle socio-economic status (2.53). Cluster 1 encompasses females with the highest MMSE (28.2), as well as the highest nWBV (0.744). Moreover, 72.7% of patients in this cluster were diagnosed as healthy at the time of the first visit (CDR=0). Among them, 12% converted during the longitudinal study.

On the other hand, Cluster 2 consists of a cohort of 30 individuals, primarily comprised of males, with an average age of 75.9 years. Their level of education is measured at an average of 12.6 years, alongside a moderate socio-economic status of 3.33. Mainly, the patients in such a cluster have the highest average age and the lowest education level (12.6) compared to the other clusters. Moreover, Cluster 2 highlights the lowest MMSE score (i.e., 26) and the lowest nWBV value (0.724) among the three clusters. By looking at Group label, 66.7% of the patients belonging to this cluster have been diagnosed as Demented.

Finally, Cluster 3 comprises a cohort of 29 male participants. This cluster comprises individuals with an average age comparable to Cluster 1. The members comprising this particular cluster exhibit the most elevated average education level (17.3) along with the highest eTIV (1700) but the lowest socio-economic status (1.52). The level of MMSE (27.6) exhibits a marginal increase compared to Cluster 2. The prevalence of healthy and demented subjects is nearly equivalent in their distribution.

Regarding the findings of the second visit, it emerges that Cluster 1 exhibits analogous attributes compared to Cluster 2 of the first visit. Similarly, Cluster 3 in the second visit exhibits comparable statistical patterns to those observed in Cluster 1 of the first visit. Similar grouping characteristics are presented for the third visit.

Thus, V1C2, V2C1, and V3C1 can be regarded as the most indicative clusters of cognitively impaired individuals, while V1C1, V2C3, and V3C3 can be regarded as the most representative clusters of healthy participants.

4. Discussion & Conclusions

In this work, we conducted an examination of a publicly available dataset, namely OASIS-2, which encompasses information about individuals diagnosed with dementia, collected over the course of five distinct visits. The primary aim of our study was to utilize a set of diferent clustering techniques ( -means, Hierarchical Clustering, Gaussian mixed models, and Spectral Clustering) for discerning patient cohorts exhibiting similar patterns over time.

To assess the efectiveness of diverse clustering methodologies regarding the outcomes of clustering processes, we analysed the clusters derived from distinct clustering algorithms employed, with a specific focus on the first visit. Particularly, by referring to the data presented in Table 1, we can see that out of the total population of 142 patients, 86 individuals have no cognitive impairment, and 56 patients have been diagnosed with mild or moderate cognitive impairment. Clustering results show that the -means algorithm successfully groups together 52.3% (45 out of 86) of healthy patients, while the hierarchical algorithm achieves a higher percentage of 61.6% (53 out of 86).

Notably, the number of healthy individuals at the initial visit is determined by summing the number of non-demented individuals and converted patients. Converted patients refer to individuals who were identified as healthy during the initial visit but later experienced a change in their health status over the follow-up period.

Moreover, the Gaussian model demonstrates a clustering accuracy of 69.8% (60 out of 86), and the Spectral Clustering algorithm accurately clusters 69.8% (60 out of 86) of healthy patients. As concerns the cognitively impaired individuals, the -means algorithm successfully groups together 53.6% (30 out of 56), the hierarchical algorithm achieves 64.3% (36 out of 56), the Gaussian model 46.4% (26 out of 56) and the Spectral Clustering 35.7% (20 out of 56). As we can see, the Spectral Clustering and the Gaussian model perform better in clustering healthy patients. On the other hand, Spectral Clustering is the worst in clustering patients with dementia, while the hierarchical model is the best. We can note that the hierarchical model reaches comparable eficacy in clustering healthy and demented patients.

Furthermore, we analysed the clusters diferently by looking at their cardinality. It is observable that the -means algorithm identifies two distinct clusters: Cluster 1, which consists of 84.9% (45 out of 53) of individuals diagnosed as healthy patients, and Cluster 3, which contains 63.8% (30 out of 47) of individuals diagnosed as cognitively impaired. The Gaussian model reveals the presence of Cluster 3, comprising 75.9% (60 out of 79) of individuals diagnosed as healthy, as well as Cluster 1, consisting of 70.2% (26 out of 37) of individuals characterized as cognitively impaired. The hierarchical model demonstrates the existence of two particular clusters, specifically Cluster 1, predominantly comprising 88.3% (53 out of 60) healthy individuals, and Cluster 3, mainly consisting of 66.7% (36 out of 54) cognitively impaired individuals.

The Spectral Clustering technique identifies two distinct clusters Cluster 1 comprising approximately 72.3% (60 out of 83) of individuals classified as healthy, whereas Cluster 2 encompassing approximately 66.7% (20 out of 30) of individuals diagnosed with cognitive impairment.

We can also note in this case that the hierarchical model reaches better results in finding more homogeneous clusters of healthy and demented patients.

Finally, we conducted an analysis to investigate the features of these clusters and identified some significant clusters where patients share common traits. Mainly, we note that although the cardinality of clusters among algorithms is diferent, the features of the clusters are almost similar. As we can see above, -means’s V1C1, Mixture Gaussian’s V1C3, Hierarchical’s V1C1 and Spectral cluster’s V1C1 could be considered as the most representative clusters grouping of healthy subjects with respect to the four algorithms.

In each of these clusters, individuals who demonstrate an MMSE score higher than 28 and nWBV greater than 0.744, along with a moderate level of education (roughly 15 years), might have a minor likelihood of being diagnosed with dementia, regardless of their socio-economic status falling within the lower-middle level. Note: Here we refer to likelihood and risk in sociological terms, not in statistical terms.

Conversely, as we can see from -means’s V1C3, Mixture Gaussian’s V1C1, Hierarchical’s V1C3, and Spectral Clustering V1C2, individuals characterized by an MMSE lower than 26 and nWBV lower than 0.72 with lower-middle educational level (around 12) have a major risk of being diagnosed as demented although they show a middle socio-economic status (> 3).

It is also worth noting that, diferently from other techniques, the clusters determined by the Spectral Clustering diferentiated mostly based on sex.

In summary, this study employed various clustering methodologies to categorize cohorts of patients exhibiting comparable characteristics, utilizing data collected from a longitudinal investigation. The clustering analysis results demonstrated the presence of discernible patterns within the distinct clusters obtained for each employed technique. Each clustering algorithm that has been adopted has successfully identified the same patterns, albeit with varying performance levels. Finally, we derived initial insights from previous findings, which will serve as a foundation for future investigations on more extensive datasets related to dementia.

• First of all, although age is not a primary cause of dementia, the risk of developing dementia increases with ageing. • Female subjects with higher values of nWBV and higher values of MMSE present a lower risk of having dementia diagnosis. • Conversely, subjects (both male and female) with lower nWBV values and a lower MMSE and education level are more likely to be diagnosed as demented. • Lower value of SES, higher education level, and higher MMSE are not good predictors for men. The demented and non-demented subjects cannot be efectively distinguished from these features. • SES values of cognitively impaired patients are distributed across a higher range value than SES of normal patients.

• Further investigation is required to understand the variations between sexes. Limitations and future developments Regarding limitations and aspects to improve, we have to report that we performed our clustering analysis on a single dataset, with a limited number of subjects. We therefore cannot claim that our results are general for most of the patients sufering from dementia: they are limited to the subjects of this dataset. Moreover, we focused our analysis on the change of the feature values, and not on the moves of the patients between clusters. In the future, we plan to extend the patient-oriented side of this study, by considering how many patients move through the clusters for each method. We also plan to apply our approach on a validation cohort dataset, even if we know that it is dificult to find a suitable one with a similar set of features. And we foresee to use longitudinal clustering algorithms such as longitudinal -means (kml [26]) and trajectory analysis (traj [27]), two-step clustering, and dynamic clustering methods [28]. Moreover, we plan to collaborate with medical doctors and collect their insights about the medical significance of the clusters. Finally, we plan to use alternative clustering techniques such as DBSCAN or fuzzy clustering.

Additional sections

Ethics approval and consent to participate The authorization for collecting the data from patients and to release them publicly was obtained by the original dataset curators [18, 19]. Data availability The analyzed dataset is available openly under the CC BY NC 3.0 licence on Mendeley Data at https://doi.org/10.17632/tsy6rbc5d4.1 Acknowledgments help.

The authors thank Luca Cufaro (Università di Milano-Bicocca) for his Funding This study work was funded by the European Union – Next Generation EU programme, in the context of The National Recovery and Resilience Plan, Investment Partenariato Esteso PE8 “Conseguenze e sfide dell’invecchiamento”, Project Age-It (Ageing Well in an Ageing Society). This work was also partially supported by Ministero dell’Università e della Ricerca of Italy under the “Dipartimenti di Eccellenza 2023-2027” ReGAInS grant assigned to Dipartimento di Informatica Sistemistica e Comunicazione at Università di Milano-Bicocca. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. [9] J. F. Beltrán, B. M. Wahba, N. Hose, D. Shasha, R. P. Kline, F. the Alzheimer’s Disease Neuroimaging Initiative, Inexpensive, non-invasive biomarkers predict Alzheimer transition using machine learning analysis of the Alzheimer’s disease neuroimaging (ADNI) database, PLOS One 15 (2020) 1–26. doi:10.1371/journal.pone.0235663. [10] J. Xu, F. Wang, Z. Xu, P. Adekkanattu, P. Brandt, G. Jiang, R. C. Kiefer, Y. Luo, C. Mao, J. A. Pacheco, L. V. Rasmussen, Y. Zhang, R. Isaacson, J. Pathak, Data-driven discovery of probable Alzheimer’s disease and related dementia subphenotypes using electronic health records, Learning Health Systems 4 (2020). doi:10.1002/lrh2.10246. [11] Y. Wang, Y. Zhao, T. M. Therneau, E. J. Atkinson, A. P. Tafti, N. Zhang, S. Amin, A. H.

Limper, S. Khosla, H. Liu, Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records, Journal of Biomedical Informatics 102 (2020) 103364. doi:10.1016/j.jbi.2019.103364. [12] N. Alexander, D. Alexander, F. Barkhof, S. Denaxas, Identifying and evaluating clinical subtypes of Alzheimer’s disease in care electronic health records using unsupervised machine learning, BMC Medical Informatics and Decision Making 21 (2021) 1–13. doi:10. 1186/s12911-021-01693-6. [13] D. Gamberger, N. Lavrač, S. Srivatsa, R. E. Tanzi, P. M. Doraiswamy, Identification of clusters of rapid and slow decliners among subjects at risk for Alzheimer’s disease, Scientific Reports 7 (2017) 6763. [14] G. Turcan, S. Peker, Profiling individuals with dementia using cluster analysis, in: Proceedings of CISTI 2023 – the 18th Iberian Conference on Information Systems and Technologies, 2023, pp. 1–7. doi:10.23919/CISTI58278.2023.10211598. [15] H. Alashwal, M. El Halaby, J. J. Crouse, A. Abdalla, A. A. Moustafa, The application of unsupervised clustering methods to Alzheimer’s disease, Frontiers in Computational Neuroscience 13 (2019). doi:10.3389/fncom.2019.00031. [16] D. S. Marcus, A. F. Fotenos, J. G. Csernansky, J. C. Morris, R. L. Buckner, Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults, Journal of Cognitive Neuroscience 22 (2010) 2677–2684. [17] OASIS, Open Access Series of Imaging Studies (OASIS), URL: https://www.oasis-brains.org/

URL visited on 20th September 2023, 2023. [18] G. Battineni, N. Chintalapudi, F. Amenta, Machine learning in medicine: performance calculation of dementia prediction by support vector machines (SVM), Informatics in Medicine Unlocked 16 (2019) 100200. [19] G. Battineni, N. Chintalapudi, F. Amenta, Data for: Machine learning in medicine: performance calculation of dementia prediction by support vector machines (SVM), Mendeley Data (2019). doi:10.17632/tsy6rbc5d4.1. [20] C. Kavitha, V. Mani, S. Srividhya, O. I. Khalaf, C. A. Tavera Romero, Early-stage Alzheimer’s disease prediction using machine learning models, Frontiers in Public Health 10 (2022) 853294. [21] A. Likas, N. Vlassis, J. J. Verbeek, The global -means clustering algorithm, Pattern recognition 36 (2003) 451–461. [22] D. A. Reynolds, Gaussian mixture models., Encyclopedia of Biometrics 741 (2009). [23] U. Von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (2007) 395–416. [24] F. Nielsen, Hierarchical clustering, Introduction to HPC with MPI for Data Science (2016) 195–211. [25] D. Xu, Y. Tian, A comprehensive survey of clustering algorithms, Annals of Data Science 2 (2015) 165–193. [26] C. Genolini, X. Alacoque, M. Sentenac, C. Arnaud, kml and kml3d: R packages to cluster longitudinal data, Journal of Statistical Software 65 (2015) 1–34. [27] K. Lefondré, M. Abrahamowicz, A. Regeasse, G. A. Hawker, E. M. Badley, J. McCusker, E. Belzile, Statistical measures were proposed for identifying longitudinal patterns of change in quantitative health indicators, Journal of Clinical Epidemiology 57 (2004) 1049–1062. [28] J. Diaz-Rozo, C. Bielza, P. Larrañaga, Machine-tool condition monitoring with Gaussian mixture models-based dynamic probabilistic clustering, Engineering Applications of Artificial Intelligence 89 (2020) 103434.

S1. Supplementary information

Cluster 1 (N=37) Cluster 2 (N=33) eTIV eTIV

Mean (Std)

Range Group

Demented

Female

Male NonDemented

Female

Male Converted

Female

Male eTIV eTIV Cluster 2 (N=70) Cluster 2 (N=42) Figure S2: Clustering Results of -means model, descriptive tables for each visit Cluster 2 (N=28) eTIV eTIV

Mean (Std)

Range Group

Demented

Female

Male NonDemented

Female

Male Converted

Female

Male age sex woman man age man woman Table S2 Descriptive statistics of the results of Spectral Clustering applied to the visit #2 subset. age sex man woman age sex

[1]

J. C.

Morris , The Clinical Dementia Rating (CDR): current version and scoring rules ., Neurology ( 1993 ).

[2]

J. R.

Cockrell ,

M. F.

Folstein , Mini-mental state examination, Principles and Practice of Geriatric Psychiatry ( 2002 ) 140 - 141 .

[3]

Desautels ,

Calvert ,

Hofman ,

Jay ,

Kerem ,

Shieh ,

Shimabukuro ,

Chettipally ,

M. D.

Feldman ,

Barton ,

D. J.

Wales , R. Das , Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach , JMIR Medical Informatics 4 ( 2016 ) e5909 .

[4]

Radhachandran ,

Garikipati ,

N. S.

Zelin ,

Pellegrini ,

Ghandian ,

Calvert ,

Hofman ,

Mao , R. Das , Prediction of short-term mortality in acute heart failure patients using minimal electronic health record data , BioData Mining 14 ( 2021 ) 1 - 15 .

[5]

C.-W.

Liang , H. -C. Yang , M. M. Islam , P. A. A.

Nguyen , Y.-T.

Feng , Z. Y.

Hou , C.-W. Huang, T. N.

Poly , Y. -C. J. Li , Predicting hepatocellular carcinoma with minimal features from electronic health records: development of a deep learning model , JMIR Cancer 7 ( 2021 ) e19812 .

[6]

Xiao ,

Choi ,

Sun , Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review , Journal of the American Medical Informatics Association 25 ( 2018 ) 1419 - 1428 . doi: 10 .1093/jamia/ocy068.

[7]

Ben Miled ,

Haas ,

C. M.

Black ,

R. K.

Khandker ,

Chandrasekaran ,

Lipton ,

M. A.

Boustani , Predicting dementia with routine care EMR data , Artificial Intelligence in Medicine 102 ( 2020 ) 101771 . doi: 10 .1016/j.artmed. 2019 . 101771 .

[8]

Stamate ,

Smith ,

Tsygancov ,

Vorobev ,

Langham ,

Stahl ,

Reeves , Applying deep learning to predicting dementia and mild cognitive impairment , in: Artificial Intelligence Applications and Innovations: Proceedings of AIAI 2020 - the 16th IFIP WG 12 .5 International Conference, Neos Marmaras, Greece, June 5-7, 2020 , Part

16, Springer, 2020 , pp. 308 - 319 .

Spectral Clustering applied to the 3rd visit subset cluster 1 cluster 2 cluster 3 (N=4) (N=19) (N=32)