Achieving data FAIRification in a distributed analytics
research platform for rare diseases
Anna Bernasconi1,∗ , Cinzia Cappiello1 , Stefano Ceri1 and Pietro Pinoli1
1
    Department of Electronics, Information and Bioengineering – Politecnico di Milano, Milan, Italy


                                        Abstract
                                        Data-driven medicine is fundamental to improving the accessibility and quality of the healthcare system.
                                        The availability of data is crucial for this purpose. In the context of a distributed analytics platform for
                                        analyzing healthcare data – employing the Personal Health Train paradigm – we propose to implement
                                        a solid data FAIRification infrastructure. This will allow us to achieve findability, accessibility, inter-
                                        operability, and reusability of data, metadata, and results within a network of several medical centers
                                        participating in the BETTER Horizon Europe project, where the study of rare diseases (such as intellectual
                                        disability and inherited retinal dystrophies) will be targeted. Impacts will be visible to a large population
                                        of healthcare practitioners, prospectively influencing health policymakers.

                                        Keywords
                                        FAIR principles, healthcare, distributed analytics, rare diseases


Data-driven medicine is a crucial research area for the achievement of a more high-quality
accessible healthcare system. Typically, the more data available for the intended analysis, the
higher the chance to achieve accurate results [1]. However, the amount of available patient data
is critical, especially in the context of rare diseases; here, even more predominantly than in other
diseases, data sets are available and usable only at single medical centers. Reasons for the lack
of data sharing are connected to ethical, legal, and privacy aspects and rules. Data centralization
is not a viable option due to privacy concerns, particularly within the European Union, where
the General Data Protection Regulation (GDPR) imposes stringent privacy standards.
   The Horizon Europe project BETTER (Better rEal-world healTh-daTa distributEd analytics
Research platform), started Dec. 1st, 2023, proposes the design and implementation of a
decentralized infrastructure that will allow us to exploit the full potential of large sets of multi-
source health data. This will be achieved by using customized AI tools to compare, integrate,
and analyze datasets in a secure as well as cost-effective fashion. The project will target
various use cases involving 7 European medical centers; they will provide sensitive patient data,
including, possibly, clinical reports, medical images, genomic data (whole-exome, whole-genome
sequences), biological data (cellular and molecular pathways), metabolic, environmental and
demographic data, patient interviews, forms, and therapies details. Only the secure information
will be made available and analyzed with a GDPR-compliant mechanism via a Distributed

The 15th Int. Semantic Web Applications and Tools for Health Care and Life Science conference (SWAT4HCLS 2024)
∗
    Corresponding author.
Envelope-Open anna.bernasconi@polimi.it (A. Bernasconi); cinzia.cappiello@polimi.it (C. Cappiello); stefano.ceri@polimi.it
(S. Ceri); pietro.pinoli@polimi.it (P. Pinoli)
Orcid 0000-0001-8016-5750 (A. Bernasconi); 0000-0001-6062-5174 (C. Cappiello); 0000-0003-0671-2415 (S. Ceri);
0000-0001-9786-2851 (P. Pinoli)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Analytics paradigm called the Personal Health Train (PHT) [2]. PHT can be explained via
a railway system analogy that includes trains, stations, and train depots. The trains use the
network to visit different stations to transport several goods, which in this analogy correspond
to analytical tasks. By adapting this concept to BETTER, the analytical task is brought to the
data provider (i.e., a medical center), whereas the data instances remain in their original location
(called station).
   As a technical partner of the project, Politecnico di Milano (i.e., the authors) will particularly
focus on BETTER’s objective to guide medical centers in collecting patients’ data following
a common schema in order to promote interoperability and re-use of datasets in scope. This
includes legal/ethical data protection authorizations as well as data FAIRification (data docu-
mentation, cataloging, and mapping to well-established ontologies [3]). We will design a unified
schema repository for medical centers’ (meta)data integration, keeping a high abstraction level
to encourage maximum interoperability (see [4, 5]).
   Importantly, the project will aim at the integration of external sources such as European
Health Data Space (EHDS), the 1+Million Genomes initiative (1+MG), and the European Open
Science Cloud. Legal and ethical implications will be duly considered, and data access and
re-use procedures will be proposed. Data pseudonymization will be performed as a default
preprocessing step, mitigating the risk of personal data leaks. A real-world large-scale data
integration framework (based on well-established ontologies) will be demonstrated taking into
account heterogeneous datasets. The platform will be tested primarily on two rare disease use
cases, inherent to pediatric intellectual disability and inherited retinal dystrophies.
   In conclusion, the BETTER project relies on “bringing computation to data” via incremental
and federated learning, which avoids unnecessary data moving across medical centers while
exploiting much of the information encoded in such data. The project will enable EU medical
centers and beyond to make full use of the potential offered by a safe and secure exchange, use,
and reuse of health data fostered by robust data FAIRification. In the context of intellectual
disability and inherited retinal dystrophies – with the potential of expanding the same paradigm
to other diseases – the generated analytical tools will help healthcare professionals become
more proficient in cutting-edge digital technologies, data-driven decision support, health risk
surveillance, and control activities, monitoring and management of healthcare quality levels.
Acknowledgements. The work is supported by BETTER, Grant agreement 101136262.
References
[1] S. Welten, et al., DAMS: A distributed analytics metadata schema, Data Intelligence 3 (2021)
    528–547.
[2] O. Beyan, et al., Distributed analytics on sensitive medical data: the personal health train,
    Data Intelligence 2 (2020) 96–107.
[3] A. Bernasconi, et al., Ontology-driven metadata enrichment for genomic datasets, in:
    SWAT4HCLS 2018, volume 2275 of CEUR Workshop Proceedings, 2018.
[4] A. Bernasconi, et al., Conceptual modeling for genomics: building an integrated repository
    of open data, in: ER 2017, Springer, 2017, pp. 325–339.
[5] A. Bernasconi, et al., A review on viral data sources and search systems for perspective
    mitigation of COVID-19, Briefings in Bioinformatics 22 (2021) 664–675.