=Paper= {{Paper |id=Vol-3890/paper-31 |storemode=property |title=DATOS-CAT: Methodologies for the standardization, integration and analysis of population-based biomedical data using semantic technologies |pdfUrl=https://ceur-ws.org/Vol-3890/paper-31.pdf |volume=Vol-3890 }} ==DATOS-CAT: Methodologies for the standardization, integration and analysis of population-based biomedical data using semantic technologies== https://ceur-ws.org/Vol-3890/paper-31.pdf
DATOS-CAT: Methodologies for the standardization,
integration and analysis of population-based
biomedical data using semantic technologies
Judith Martinez-Gonzalez1,2,*,† , Guillem Bracons Cucó1,† , Aikaterini Lymperidou1,3 ,
Xavier Escribà-Montagut4 , Marta Huertas1,5 , Ramon Mateo-Navarro1,4 ,
David Sarrat-González4 , Santiago Frid6 , Rafael de Cid3 , Juan R González4 and
Alberto Labarga2
1
  Institute for Bioengineering of Catalonia, Barcelona, Spain
2
  Barcelona Supercomputing Center, Barcelona, Spain
3
  Genomes for Life- GCAT lab- Germans Trias i Pujol Research Institute, Badalona, Spain
4
  Barcelona Institute for Global Health, Barcelona, Spain
5
  Centre for Genomic Regulation, Barcelona, Spain
6
  Hospital Clínic de Barcelona, Barcelona, Spain



1. Introduction
In the context of personalized medicine, long-term data collection allows researchers to track
how diseases progress over time, identify patterns of environmental and genetic risk, and
assess the impact of different treatment strategies. The project DATOS-CAT aims to increase
the visibility and scientific impact of the population-based cohorts developed in Catalonia,
GCAT|Genomes for life and the COVICAT-CONTENT sub-cohort, and to contribute to the
development of procedures applicable to other cohorts, improving the level of interoperability
of their data in the context of the FAIR data ecosystem principles (Findable, Accessible, Interop-
erable, Reusables) to facilitate their exploitation and scientific use. As a basis for development,
the European Genome-Phenome Archive (EGA) infrastructure will be used for the genomic
data, and the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM)
compatible standards will be used for working with structured clinical data.


SWAT4HCLS ’24: The 15th International SWAT4HCLS conference, February 26-29, 2024, Leiden, The Netherlands.
*
  Corresponding author.
†
  These authors contributed equally.
$ judith.martinez@bsc.es (J. Martinez-Gonzalez); bracons@recerca.clinic.cat (G. Bracons Cucó);
alymperidou@igtp.cat (A. Lymperidou); xavier.escriba@isglobal.org (X. Escribà-Montagut); marta.huertas@crg.eu
(M. Huertas); ramon.mateo@isglobal.org (R. Mateo-Navarro); david.sarrat@isglobal.org (D. Sarrat-González);
frid@clinic.cat (S. Frid); rdecid@igtp.cat (R. d. Cid); juanr.gonzalez@isglobal.org (J. R. González); alabarga@bsc.es
(A. Labarga)
 0009-0009-9441-5971 (J. Martinez-Gonzalez); 0000-0003-1274-7403 (G. Bracons Cucó); 0009-0002-3425-2188
(A. Lymperidou); 0000-0003-2888-8948 (X. Escribà-Montagut); 0009-0000-9334-108X (R. Mateo-Navarro);
0000-0002-9064-3303 (D. Sarrat-González); 0000-0001-8400-5770 (S. Frid); 0000-0003-3579-6777 (R. d. Cid);
0000-0003-3267-2146 (J. R. González); 0000-0001-6781-893X (A. Labarga)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Cohorts
The GCAT project has been following 20.000 individuals living in Catalunya since 2014. Exten-
sive clinical, genomic and lifestyle data has been collected, set up to integrate and assess the role
of the environmental and genetic factors (i.e, genomic, metabolomic, proteomic, epigenomic) in
the development of chronic diseases. The project aims to assess the prevalence of risk factors and
their association with disease incidence over time. The COVICAT-CONTENT is a sub-cohort
involving 11.833 participants, mostly derived from the matrix GCAT cohort. Created in 2020
with the specific aim of monitoring COVID-19, the final objective was to evaluate determinants,
ultimately contributing to the definition of predictive models and control policies for similar
situations in the future.


3. Methods
The OMOP-CDM is intended to be used as a standardized framework for data representation and
analysis. The OMOP-CDM is a person-centric model, which establishes a common vocabulary
and data format, allowing for seamless integration across disparate heterogeneous sources.
   Building an OMOP-CDM database requires Extract-Transform-Load (ETL) processes that
facilitate the data conversion to a standard model and terminology. ETL plays an important role
in ensuring data consistency and integrity through the integration process, enabling a smooth
transition from heterogeneous datasets to a coherent and standardized structure. DATOS-CAT
project evaluates two different ETL methodologies: a traditional ETL and a semantic ETL.
   On one hand, the traditional ETL framework involves sequential steps. First, the Extract phase
allows us to gather raw data from multiple sources, such as EHR. General information from
the database can be extracted using the White Rabbit software. Once extracted, the Transform
phase allows the standardization and harmonization of diverse data formats, terminologies
and structures into the OMOP-CDM, where the data is restructured in tables and mapped to
Standardized Vocabularies. In this second phase, Rabbit-in-a-Hat software can be useful for
mapping variables to the corresponding tables and columns of the OMOP-CDM. Based on the
resulting mapping, a specific SQL transformation process can be implemented to transform raw
data into the desired format. This last point usually involves mapping source-specific codes
(e.g., ICD-10, RxNorm, etc.) to OMOP standard terminologies. After the transformation, the
Load phase allows to populate the unified database to a OMOP-CDM compliant format. Each
of the steps is crucial for effectively converting heterogeneous raw data into a standardized
OMOP format.
   On the other hand, having standardized data with their semantic meaning and syntactic
structure is key for conducting scalable and interoperable secondary data studies. To achieve
such data normalization, the Ontobridge tool has been developed. Ontobridge is an ontology-
based tool, created by the Hospital Clínic de Barcelona, that transforms local databases to CDMs.
It uses semantic mappings of data between an ontology representing the local data model, and
one or more ontologies, representing data standards to transform local databases into standards
(OMOP-CDM on our use case), in a secure, fast, scalable manner, that maintains the semantics
and syntactic structure of the data. Specifically, Ontobridge uses Ontop to replicate relational
data into RDFs through R2RML mappings, and then inserts the triples into an ontology that
represent the local data model. Local concepts are mapped to standard ones by means of the
owl:sameAs property to allow for semantic interoperability, while the syntactic equivalence is
performed by defining that local properties to be instances of metaclasses that model attributes
of the common data models. These ontologies are loaded into a Jena Fuseki server, and the
corresponding SPARQL queries are executed to generate the OMOP-CDM tables.
   On this project, we examine the adaptability, efficiency and scalability of each ETL approach
in achieving data consistency and integrity.


4. Results
The standardization of both cohorts is indispensable, as it offers more robust pathways for
collaborative researchers and facilitates their integration to global genomic and clinical data
repositories. During this process, ETL plays a crucial role since it enables the transformation of
raw heterogeneous data from GCAT and COVICAT-CONTENT into standardized and consistent
formats. At the end, the integration empowers researchers to access direct sources of information,
enabling a deeper understanding of how genetic profiles relate to clinical outcomes, unlocking
potential breakthrough discoveries and providing new knowledge that can significantly influence
health care practices around the world. DATOS-CAT seeks to contribute to making these cohorts
more competitive at a global level, by connecting them with direct sources of genomic and
clinical information, ultimately resulting in a greater benefit for society. The proposal also
contains the elements to align and collaborate with the efforts carried out by the IMPaCT
programme and also with other European parallel developments.