FAIR for automatic federated omics analysis
                                Daphne Wijnbergen1,∗ , Georgios Malamas1 , Marco Roos1 and Eleni Mina1
                                1
                                    Human Genetics, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands


                                              Abstract
                                              In this work, we create a workflow to apply federated gene expression meta-analysis in the Virtual
                                              Platform of the EJP RD. Based on this workflow, we identify which metadata is needed to make the data
                                              machine actionable. We then present a metadata schema that is based on the EJP RD metadata schema
                                              and consists of scientific, biological and file metadata.

                                              Keywords
                                              FAIR, Federated analysis, metadata, machine actionability


                                1. Introduction
                                In the analysis of biomedical data, a large amount of time and effort is spent on finding datasets,
                                mapping identifiers, and data munging. An initiative that can help mitigate these issues is FAIR
                                [1]. With FAIR, machines can increasingly perform actions on data without human intervention,
                                if machine-actionable metadata is provided.
                                   Another factor that hinders the application of data analysis in biomedical research is privacy.
                                Human data, such as genomic data, is privacy sensitive and can not be fully anonymized.
                                Consequently, data often cannot be accessed and analyzed from outside the institute where
                                it was generated. Multiple efforts are ongoing to create infrastructures that enable federated
                                analysis of data. In this paradigm, an analysis method can be sent from one institute to the data
                                of another and executed, if approved. The results are then sent back to the first institute. This
                                ensures that the analysis can be performed, while privacy is preserved. One such effort is the
                                development of the “Virtual Platform” (VP) network of FAIR resources by the European Joint
                                Programme on Rare Diseases (EJP RD). Currently, various resources relevant for rare diseases
                                are FAIRified and connected within the VP. One goal of the EJP RD is to enable automated,
                                federated analysis over the resources in the VP.
                                   In our project, we created a workflow to apply federated analysis on omics data for rare
                                diseases. To achieve this, we have identified what metadata is needed to perform this analysis
                                by machines in an automated way.


                                SWAT4HCLS: 15th International SWAT4HCLS Conference, February 26–29, 2024, Leiden, The Netherlands

                                Envelope-Open d.wijnbergen@lumc.nl (D. Wijnbergen); yiorgos.malamas@student.uva.nl (G. Malamas); m.roos@lumc.nl
                                (M. Roos); e.mina@lumc.nl (E. Mina)
                                Orcid 0000-0002-7449-6657 (D. Wijnbergen); 0009-0001-9556-2158 (G. Malamas); 0000-0002-8691-772X (M. Roos);
                                0000-0002-8972-9206 (E. Mina)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

                                                CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Methods
We implemented a workflow for gene expression analysis in Inclusion Body Myositis to serve
as the basis of our use case. This workflow consists of four main steps: (1) Identifying transcrip-
tomics datasets of interest (2) Applying differential gene expression analysis on these datasets (3)
Mapping identifiers between datasets for data integration, (4) applying meta-analysis (analysis
of multiple analysis results) to determine which genes are differentially expressed in multiple
studies in Inclusion Body Myositis.
   We identified what metadata is necessary for the data to be machine actionable for the purpose
of this use case. The VP metadata schema and an extension of the Data Catalog Vocabulary
(DCAT) [2] were analyzed and extended with metadata elements needed for our use case.


3. Results
We defined a metadata schema that extends the EJP RD and DCAT metadata schemas. This
schema contains metadata in three categories: 1.Scientific metadata such as the measurement
type, measurement device, and study design, that help find datasets that are measuring variables
of interest E.g. measurements of gene expression in case vs control. 2. Biological metadata such
as the disease, species and tissue, that are needed to select datasets that are biologically relevant
for the research question; e.g. Selecting datasets for Inclusion Body Myositis. 3. Metadata about
the data file itself, such as the download URL, the format, the media type, and a domain specific
file specification, that are needed for the machine to understand how to use the data.


4. Discussion
In this work, we created a workflow for detecting differential gene expression in various
transcriptomics datasets together with a metadata schema to make these datasets machine
actionable. Our work enables a machine to automatically run this workflow in a federated
manner on (privacy-sensitive) omics datasets for various rare diseases in the VP.


Acknowledgments
We would like to thank Luiz Bonino, Mark Wilkinson, Andra Waagmeester, Henriette Harmse,
Sunil Rodger, Alberto Camara Ballesteros, Wolmar Akerstrom, Eric Prud’Hommeaux and
Alexandra Tataru for helpful discussions. This initiative received funding from the EU Horizon
2020, grant agreements 825575 (EJP RD) and 824087 (EOSC-Life), and ELIXIR.


References
[1] M. D. Wilkinson et al., Comment: The FAIR guiding principles for scientific data manage-
    ment and stewardship, Scientific Data 3 (2016) 1–9. doi:10.1038/sdata.2016.18 .
[2] R. Albertoni et al., Data catalog vocabulary (DCAT) - version 2, 2020. URL: https://www.w3.
    org/TR/vocab-dcat-2/.