1. Introduction

Metaverse for Intelligent Healthcare: Opportunities and Challenges

Valerio Guarrasi

valerio.guarrasi@unicampus.it 2 3

Lorenzo Tronchin

2 3

Camillo Maria Caruso

2 3

Aurora Rofena

2 3

Guido Manni

2 3

Fatih Aksu

0 3

Domenico Paolo

2 3

Giulio Iannello

2 3

Rosa Sicilia

2 3

Ermanno Cordelli

2 3

Paolo Soda

1 2 3 0 Department of Biomedical Sciences, Humanitas University , Milan , Italy 1 Department of Radiation Sciences, Radiation Physics, Biomedical Engineering, Umeå University , Sweden 2 Unit of Computer Systems and Bioinformatics, Department of Engineering, University Campus Bio-Medico of Rome , Italy 3 for AI models. Therefore, there is a need for resilient AI

2023

29 31

This abstract discusses the development of a metaverse for intelligent healthcare, which involves creating a virtual environment where healthcare professionals, patients, and researchers can interact and collaborate using digital technologies. The metaverse can improve the eficiency and efectiveness of healthcare services and provide new opportunities for research and innovation. AI models are necessary for analyzing patient data and providing personalized healthcare recommendations, but the data in a metaverse setting is inherently multimodal, unstructured, noisy, incomplete, limited, or partially inconsistent, which poses a challenge for AI models. However, it becomes necessary the integration of AI models for the development of virtual scanners to simulate image modalities, and robotics to simulate surgical procedures within a virtual environment. The ultimate goal is to leverage the power of AI to enhance the quality of healthcare in a metaverse for intelligent healthcare, which has the potential to transform the way healthcare services are delivered and improve health outcomes for patients worldwide.

Challenges

1. Introduction

The development of a metaverse for intelligent healthcare involves the creation of a virtual environment where healthcare professionals, patients, and researchers can interact and collaborate using digital technologies [1]. The metaverse can be thought of as a virtual world where users can engage with each other in real-time, create and manipulate digital objects, and perform complex tasks in a simulated environment. In the context of healthcare, the metaverse can be used to improve the eficiency and efectiveness of healthcare services, as well as provide ample, healthcare professionals can use the metaverse to perform virtual consultations with patients, monitor their health remotely, and collaborate with other healthcare providers in real-time. care requires a range of technologies and skills, including artificial intelligence (AI), virtual and augmented reality, natural language processing, and data analytics. These technologies can be used to create realistic simulations of healthcare scenarios, analyze patient data, and pro2.

Multimodal Learning The healthcare metaverse can use multimodal data in var Multimodal data refers to data that is generated from

new opportunities for research and innovation. For ex- limited, or partially inconsistent, which poses a challenge

The development of a metaverse for intelligent health- ners to generate virtual medical images for patients. This multiple sources [2].

By combining data from diferent sources, such as medical images, genetic information, medical history available within the electronic health records, and lifestyle data, the healthcare metaverse can exploit multimodal data to create personalized patient health profiles that, in turn, ofer the chance to tailor treatments and interventions. This would help improve the accuracy and eficiency of medical diagnoses and prognoses.

Next sections overview research ongoing in our lab on this topic. 2.1. When, which and how? end-to-end model. It exploits Pareto multi-objective optimization working with a performance metric and the diversity score of multiple candidate unimodal neural networks to be fused. We attain state-of-the-art results, not only outperforming the baseline performance but also being robust to external validation. Via this method, we automatically understand “which” are the most suited modalities and models for the task, “when” the fusion of the modalities should occur, and “how” to optimally fuse them.

2.2. Multimodal Ensemble for Overall Survival

Deep multimodal learning has shown promising perfor- In the context of lung cancer, multimodal learning is mance also encompassing those attained by traditional becoming an increasingly important research area as machine learning approaches. This happens because it can provide insights into the optimal treatment for deep neural networks permit us to fuse the learners ex- aggressive tumors. In this particular work [5], we utilized ploiting the loss backpropagation at diferent depths. the CLARO dataset [7], which includes CT images and However, understanding “when”, “which”, and “how” clinical data from non-small-cell lung cancer patients, to to fuse the modalities is the main open methodological investigate the use of multimodal learning for predicting research question now opened. overall survival.

The “when” question involves determining the optimal We employed a late fusion approach, which involves depth in the network architecture to combine diferent training unimodal models on each modality separately types of data, such as imaging and clinical data, to im- and then combining the results using an ensemble prove accuracy and reliability. To determine the ideal method. We selected the optimal set of classifiers for point of fusion among the diferent modalities in a joint the ensemble by solving a multi-objective optimization fusion scenario, we have developed an iterative algorithm problem that maximizes both performance and diversity that increases the number of fusion connections among of the unimodal models. The results of this study demonconvolutional networks, enabling us to obtain optimal strate the potential of multimodal learning in improving results. Our findings indicate that the gradual fusion the prediction of overall survival in lung cancer patients. mechanism among modalities is highly efective, result- The proposed ensemble outperformed models trained on ing in superior performance compared to traditional fu- a single modality and achieves state-of-the-art results on sion techniques. the task at hand.

The “which” question involves determining which modalities and which models to fuse. Indeed, given a 2.3. Multimodal XAI task characterized by a specific source of data available, practitioners and researchers often need to find out the Explainable Artificial Intelligence (XAI) methods are bemost relevant and useful models. In this respect in [3] coming increasingly important in the development of we present a multi-objective optimized ensemble search intelligent healthcare systems, particularly in the context that selects the best ensemble of networks to satisfy a of multimodal learning [8]. classification task, outperforming individual models. Our The use of multimodal learning can also make it more method optimizes the search of this optimal ensemble by challenging to understand how the model arrived at its deexploiting both evaluation and diversity metrics and it cision, particularly when the model is making decisions was applied to various contexts like COVID-19 diagno- based on multiple inputs. This is where XAI methods sis [4] and lung cancer overall survival [5] (deepened in become essential. XAI methods can help healthcare prothe next subsection). fessionals and patients understand how the model arrived

The “how” question deals with determining how to at its decision by providing interpretable and transparent fuse the modalities. This involves determining the best explanations. This can help increase trust in the system method for integrating diferent types of data, such as and ultimately improve patient outcomes. using convolutional neural networks to analyze images In the context of a metaverse for intelligent healthcare, and recurrent neural networks to analyze time-series XAI methods could be used to explain how the system data. is diagnosing a patient based on their symptoms and

To cope with these three questions in [6] we present medical history. This explanation could be provided in a a novel approach optimizing the setup of a multimodal visual or interactive format, making it easy for the patient the immediate broadcast allows the entire paradigm to be to understand and engage with. restructured by building a single model passed between

Practical applications have impacted supervised mul- clients, even eliminating the role of the server and thus timodal fusion to explain decisions taken when identi- halving the number of parameters sent in each round. fying patients afected by SARS-CoV-2 at risk of severe outcomes, such as intensive care or death. Using chest X-ray scans and clinical data, our approach creates a 3. Resilient AI multimodal embedded representation of the data using The problem of unstructured, noisy, incomplete, limited a convolutional Autoencoder for the imaging modality in number, or partially inconsistent data is a significant and an Autoencoder for the tabular modality [9]. This challenge for many areas of data-driven research, espeembedded representation is then joined with a multi- cially in healthcare. In AI, such situations could impact layer perceptron that performs the classification task. To models’ accuracy and reliability, leading to incorrect or help us understand how the network performs classifica- biased outcomes. Hence, developing resilient AI systems tions, our novel XAI algorithm works on the variations of able to handle such types of data is crucial to ensure the the embedding, computed by applying a latent shift that ethical and responsible use of AI in various applications. simulates a counterfactual prediction. This reveals the features of each modality that contribute the most to the decision and computes a quantitative score indicating the 3.1. Missing Features modality’s importance. By reducing the model’s opacity, this approach improves trust and transparency for doctors and regulators who may have dificulty trusting the MDL models: indeed, the results on the AIforCOVID dataset [10], which contains multimodal data of clinical records and chest X-ray images of patients from six Italian hospitals, show that the proposed method provides meaningful explanations without degrading the classification performance for the early identification of COVID-19-positive patients at risk of severe outcome.

Missing data is a common problem in healthcare datasets,

occurring when some information is not available for some patients or variables in a dataset. Missing data not only could bias the results, but it often contrasts with the needs of AI models, which often require complete data to function properly.

There are several methods for dealing with missing data in healthcare datasets, including imputation or deletion techniques. However, each approach has its own strengths and limitations, and the appropriate method depends on the specifics of the dataset and the research 2.4. Federated Learning question being addressed.

To address such issues, we developed a Transformer It is common practice in AI research to collect data from model that masks the missing features during trainmultiple sources and send it to a central server for compu- ing [12], so that the model ignores any missing data tation. However, this approach poses several challenges during training. This approach eliminates the need for in healthcare where sensitive patient data must be pro- imputation or deletion techniques, as the model simply tected to avoid privacy violations. does not consider the missing data. In practice, the pro

To address this issue, federated learning has emerged posed method leverages the idea of the mask inside the as a potential solution [11]. This is a critical require- self-attention module to learn from incomplete input ment for the development of an eficient metaverse of data, which signals to the model only the positional enintelligent healthcare, where patient data is protected coding eliminating the input embeddings. Furthermore, and used to enhance patient care. Federated learning al- to shift to the use of tabular data with a Transformer lows for training of a shared global model with a central model, we introduce a novel type of positional encoding server while keeping the data private in local institutions, that identifies the feature and not the position. thereby promoting a greener and more secure practice. We experimentally validated the method on a clas

In this area, we propose a new paradigm as a variant of sification task that aims to predict the overall survival the most widely used approach, in which, instead of train- in patients afected by non-small-cell lung cancer using ing individual client models independently for a number clinical data from the CLARO [7] project. The results of epochs each turn, and then sending their weights back show that this approach overcomes the limitations of trato the server, which takes care of aggregating them into a ditional imputation methods, reduces bias, and improves single set of weights and redistributing them back to the the accuracy of the final analysis. models, the concept of a token is introduced, passed to each epoch sequentially or randomly among the clients, which is intended to allow the weights to be sent to the 3.2. Siamese Networks server only by its owner, who redistributes them directly to all models. The absence of local training epochs and It is well known that the AI’s power of analyzing vast amounts of data is an element lying behind models’ per4. Virtual Scanner formance. However, data availability is a major barrier in many domains, healthcare and metaverse included. To overcome this limitation, several works in the literature In a metaverse for intelligent healthcare, a virtual scanhave studied how to learn in case of limited training data, ner refers to a computer-generated imaging device that and, among them, Siamese networks are a viable alter- uses virtual reality technology to create medical images native. They consist of two or more identical networks of a patient’s body. Virtual scanners can be used to creworking in parallel and triplet networks is an established ate detailed 3D images of internal organs, bones, tissues, subtype of this approach, where three identical networks and other structures without the need for invasive proare used. During training, two of the inputs of these cedures, also minimizing patient discomfort as well as three networks belong to the same class, while the third allowing medical professionals to view and manipulate belongs to a diferent class. The main goal is to learn images in ways that would not be possible with tradia feature space where each class forms a cluster, used tional imaging techniques. AI image translation models in the inference step to classify new instances. Since can be particularly useful in this context. These models triplet networks utilize inter-class diversities in addition use deep learning (DL) algorithms to translate medical to intra-class similarities, and the number of possible images from one modality to another. In this context, we triplets is much higher than the number of instances, are working in three directions. they can be a convenient option for applications where In the first [ 13], we aim to develop and validate DL the data is limited. models to perform Virtual Contrast Enhancement (VCE)

On a private dataset with 86 patients, representing a in Contrast Enhanced Spectral Mammography (CESM). real case with limited data, we demonstrated that triplet It is a diagnostic imaging technique in breast cancer that, networks outperform deep networks using the softmax unlike standard mammography, involves the injection of classifier to predict histological NSCLC subtypes from an iodinated contrast medium, which difuses into the tuCT scans. mor tissue and enhances lesion visibility. Although this results in improved diagnostic accuracy, especially in pa3.3. Name-entity Recognition tients with dense parenchymal tissue, and although it is a more afordable and accessible alternative to contrastThe use of electronic health records (EHRs) can provide enhanced MRI, CESM has two main weaknesses. One is a valuable data for medical research since physicians regis- biological risk due to ionizing radiation, whilst the other ter information that describes symptoms, the diagnosis, refers to possible reactions to iodinated contrast agents, treatments, and the evolution of the patient over time, such as nephropathy. CESM acquires a low-energy (LE) which represent an invaluable source of data to build and a high-energy (HE) image in quick succession for patient metaverse. However, the analysis of EHRs can each breast: the LE image is equivalent to standard mambe dificult due to the presence of a large amount of un- mography, whereas the HE image is post-processed with structured data and complex clinical language used by the corresponding LE image to compute the recombined physicians. Clinical name-entity recognition (NER) is the image, which suppresses parenchymal tissue so that only natural language processing task of identifying and cate- areas of contrast enhancement are visible. Our VCE task gorizing medical information (entities) in clinical text. asks AI models to output a virtually recombined image

We investigate the use of clinical NER to extract rele- having only the LE image as input. To this end, we comvant clinical information from Italian-language EHRs of pared three DL models: an autoencoder, a CycleGAN, NSCLC patients included in the CLARO project, investi- and the Pix2Pix. Results on a public [14] and on a private gating a transformed-based approach. In particular, we dataset with 1003 and 105 patients scans, respectively, aim to fine-tune bioBIT, an Italian biomedical pretrained show that the CycleGAN is able to efectively perform model, that derives from BERT, for the task of clinical the image-to-image translation task in the lesion areas, NER. To this end, we also first identified a set of relevant suggesting VCE is a viable alternative that deserves more entities, which are then used to annotate the data cohort. research eforts. Therefore, VCE can be useful in a meta

Class imbalance learning is a common issue for clinical verse for intelligent healthcare since it can reduce the NER since some entities may appear much less frequently need for the injection of an iodinated contrast medium than others. To overcome it, we propose to use the Focal and it can eliminate the need for undergoing double radiLoss, making the model more focused on rarer entities. ation. This would make imaging procedures safer and it On a cohort of patients afected by Non-Small Cell Lung would reduce the time and cost associated with obtaining Cancer, we obtained promising f1 score results in the diagnostic images. recognition of clinical entities. Turning our attention to the second task, we aim to develop and validate DL models to perform MR-to-CT image translation—to generate synthetic CT (sCT) images from real MR images. While CT is required for treatment planning, due to its excellent anatomic localisation, it is often complemented with MR imaging, because of its superior soft-tissue contrast. However, taking multiple images can be cost-prohibitive, burdensome to the patient, and problematic in light of CT ionization risk. For these reasons, MR-only treatment planning has become an attractive alternative, and one of the main tasks in the MR-only framework is MR-to-CT image translation—to generate sCT images from real MR images. Many DL models have been developed for the MR-to-CT translation task, using convolutional neural networks to generate sCT images, by minimizing pixel-wise diferences to reference CT volumes. More recently, GAN models have been used to solve this task: the pix2pix model has been used, obtaining very promising results when there are coregistered MR and CT images, and the CycleGAN model has been used to handle the case with unpaired MR and CT images. One of the main issues, however, has consistently been the limited access to large datasets. Most ML tasks in medical imaging sufer from a lack of data and/or a lack of manually annotated data. To manage this issue, a technique called data augmentation has been widely utilized to improve generalization and robustness when training deep neural networks. We developed a novel strategy for latent data augmentation, exploiting the learned internal latent representation of a StyleGAN2 model, to generate medical images of suficiently high resolution and quality that can be used to augment the real patient data available to develop a model that can generate sCT images from MR images. The StyleGAN2 model allows to generate multiple imaging modalities (i.e., MR and CT) of the same underlying synthetic patient. A number of augmented/generated images are used together with a number of real images (the fraction decided by the analyst) to train the MRI-to-CT translation model. The StyleGAN2 model is able to generate highly realistic synthetic images, but it does not provide any control over the generated images. We developed a novel generative procedure that has two main components: 1) an improved inverse generator, that embeds real MR and CT images in the StyleGAN2 model’s learnt latent space by enforcing the resulting regenerated images to retain the semantic properties of the original real images at multiple levels; 2) a regularisation term controlling the distance in the latent space of a new purely generated image to the real images, which allows the analyst to tune the variability in the generated images. In particular, the proposed approach allows to maximize the diversity of the data sampled from the GAN manifold while retaining their realness, i.e., the synthesized images will be plausible datapoints from the learned real manifold (of real CT-MR images). A comparative analysis between the performance of the pix2pix model trained with and without data augmentation procedure is performed, showing a reduction of both scores using latent augmentation procedure. Thus, latent DA can be used to develop better MR-to-CT allowing the doctor to inspect multiple image modalities without the need for invasive procedures.

In the context of sCT generation, another pillar of our research focuses on developing denoise techniques with the aim of reducing the dose while maintaining the quality of the virtual scanner provided to the clinician. Indeed, under the ALARA (As Low As Reasonably Achievable) principle, the use of low-Dose (LD) acquisition protocol has become a clinical practice to be preferred over high-Dose (HD) protocols. However, if on the one side a reduction of radiations can be achieved, on the other, the overall quality of the reconstructed CT decreases.

Thereby, a trade-of between dose and noise level must be tackled to ensure suficient diagnostic image quality.

This is the reason why a lot of efort has been put into investigating denoising strategies, trying to obtain highquality CT images at the lowest cost in terms of radiation.

To overcome this task, we develop a texture-based loss function to be included in the objective of the CycleGAN during training from LDCT to HDCT images. We move from the hypothesis that the noise due to LD protocols has a textural nature. Thus a texture-based loss will be beneficial during training allowing a better denoising quality and faster training.

Overall, virtual scanners and AI image translation models have the potential to revolutionize healthcare by making it more eficient, less invasive, and more personalized.

5. Virtual Surgery

Robotic surgery has revolutionized complex medical applications such as minimally invasive, orthopedic, brain, and radiotherapy surgeries. With the adoption of robotic surgery, automation in the surgical domain has become a crucial topic, presenting potential benefits like improved consistency, increased dexterity, and access to standard techniques [15].

Path planning of surgical robots’ end efectors is a promising area for automation, which can automate surgical tasks or provide suggestions to the surgeon. However, classical path planning algorithms have limitations in dealing with dynamic environments like the surgical environment. In this respect, early attempts in this field have proven the feasibility and the potential impacts of the automation, but also the limitations of classical path planning algorithms. Reinforcement learning, even exploiting deep architectures, could be a viable alternative for path planning helping overcome such limitations.

For this reason, we are developing a deep reinforcement learning (DRL) framework that can plan an optimal path from the entry point to the target point within the patient, and update it during surgery using the endoscopic camera feedback. Our study is divided into two

Acknowledgments We acknowledge: FONDO PER LA CRESCITA SOSTENI

BILE (F.C.S.) Bando Accordo Innovazione DM 24/5/2017 (Ministero delle Imprese e del Made in Italy), under the project entitled “Piattaforma per la Medicina di Precisione. Intelligenza Artificiale e Diagnostica Clinica Integrata” (CUP B89J23000580005); PNRR MUR project PE0000013-FAIR; the project n. F/130096/01-05/X38 Fondo per la Crescita Sostenibile - ACCORDI PER L’INNOVAZIONE DI CUI AL D.M. 24 MAGGIO 2017 - Ministero dello Sviluppo Economico (Italy), iii) Programma Operativo Nazionale (PON) “Ricerca e Innovazione” 20142020 CCI2014IT16M2OP005 Azione IV.4.; Regione Lazio PO FSE 2014-2020 Avviso Pubblico “Contributi per la permanenza nel mondo accademico delle eccellenze” Asse III – Istruzione e formazione - Priorità di investimento 10 ii) - Obiettivo specifico 10.5 Azione Cardine 21. subproblems: global and local. The global subproblem involves planning the optimal trajectory, given information about the entry point, target point, and a digital twin of the patient obtained from CT/MRI scans. Conversely, the local subproblem focuses on updating the path based on the endoscopic video feed.

To tackle both subproblems we developed a simulation environment that generates a digital twin of the patient from CT/MRI scans and constructs a 3D environment of the surgical environment from monocular endoscopic images. The DRL agent is then trained to plan a global optimal trajectory and learn to update the local trajectory in real-time from the video feedback.

Potential future applications of this DRL framework into the metaverse are straightforward. For example, in augmented reality, the DRL agent could be utilized to provide suggestions to the surgeon on how to plan the next move in the surgical environment, by overlaying the DRL-generated path onto the surgeon’s field of vision.

Furthermore, the simulation environment and DRL agent can also be used to train surgeons on diferent cases in a safe and controlled environment. for covid-19 diagnosis from chest x-rays, Pattern

Recognition 121 (2022) 108242. [4] V. Guarrasi, N. C. D’Amico, R. Sicilia, E. Cordelli,

P. Soda, A multi-expert system to detect covid-19 cases in x-ray images, in: 2021 IEEE 34th International Symposium on Computer-Based Medical

Systems (CBMS), IEEE, 2021, pp. 395–400. [5] C. M. Caruso, V. Guarrasi, E. Cordelli, R. Sicilia, S. Gentile, L. Messina, M. Fiore, C. Piccolo, B. Beomonte Zobel, G. Iannello, et al., A multimodal ensemble driven by multiobjective optimisation to predict overall survival in non-small-cell lung cancer, Journal of Imaging 8 (2022) 298. [6] V. Guarrasi, P. Soda, Multi-objective optimization determines when, which and how to fuse deep networks: An application to predict covid-19 outcomes, Computers in Biology and Medicine 154 (2023) 106625. [7] N. C. D’Amico, R. Sicilia, E. Cordelli, L. Tronchin,

C. Greco, M. Fiore, A. Carnevale, G. Iannello, S. Ramella, P. Soda, Radiomics-based prediction of overall survival in lung cancer using diferent volumes-of-interest, Applied Sciences 10 (2020) 6425. [8] G. Joshi, et al., A review on explainability in multimodal deep neural nets, IEEE Access 9 (2021) 59800–59821. [9] V. Guarrasi, L. Tronchin, D. Albano, E. Faiella,

D. Fazzini, D. Santucci, P. Soda, Multimodal explainability via latent shift applied to covid-19 stratification, arXiv preprint arXiv:2212.14084 (2022). [10] P. Soda, N. C. D’Amico, J. Tessadori, G. Valbusa,

V. Guarrasi, C. Bortolotto, M. U. Akbar, R. Sicilia, E. Cordelli, D. Fazzini, et al., Aiforcovid: Predicting the clinical outcomes in patients with covid-19 applying ai to chest-x-rays. an italian multicentre study, Medical image analysis 74 (2021) 102216. [11] N. Truong, et al., Privacy preservation in federated learning: An insightful survey from the gdpr perspective, Computers & Security 110 (2021) 102402. [12] C. M. Caruso, P. Soda, V. Guarrasi, A deep learning approach for overall survival analysis with missing data, Submitted to: 2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS) (2023). [1] G. Wang, A. Badal, X. Jia, J. S. Maltz, K. Mueller, K. J. [13] A. Rofena, P. Soda, V. Guarrasi, A deep learning apMyers, C. Niu, M. Vannier, P. Yan, Z. Yu, et al., De- proach for virtual contrast enhancement in contrast velopment of metaverse for intelligent healthcare, enhanced spectral mammography (cesm), SubmitNature Machine Intelligence (2022) 1–8. ted to: 2023 IEEE 36th International Symposium on [2] T. Baltrušaitis, et al., Multimodal machine learn- Computer-Based Medical Systems (CBMS) (2023). ing: A survey and taxonomy, IEEE transactions on [14] R. Khaled, et al., Categorized contrast enhanced pattern analysis and machine intelligence 41 (2018) mammography dataset for diagnostic and artificial 423–443. intelligence research, Scientific Data 9 (2022) 122. [3] V. Guarrasi, N. C. D’Amico, R. Sicilia, E. Cordelli, [15] P. Fiorini, Automation and autonomy in robotic P. Soda, Pareto optimization of deep networks surgery, 2021.