=Paper=
{{Paper
|id=Vol-3890/paper-29
|storemode=property
|title=Updating the SynthDNASim tool to create diverse synthetic DNA datasets
|pdfUrl=https://ceur-ws.org/Vol-3890/paper-29.pdf
|volume=Vol-3890
}}
==Updating the SynthDNASim tool to create diverse synthetic DNA datasets==
Updating The SynthDNASim tool to create diverse synthetic
DNA datasets
Caitlin Jenstera,b,c, Rick Overkleefta,c and Núria Queralt-Rosinachc
a
4MedBox Nederland B.V, Kanaalpark 157, Leiden, 2321 JW, Netherlands
b University of Applied science Leiden, Zernikedreef 11, Leiden, 2333 CK, Netherlands
c Leiden University Medical Center, Albinusdreef 2, Leiden, 2333 ZA, Netherlands 9
Abstract
In biomedical research, it is common to perform numerous analyses of genomic data, for example,
to understand the cause of a particular disease. Regulatory laws protect the privacy of individuals
but hinder access to genomic data. One solution to this is the development of bioinformatic tools to
create synthetic DNA data. One of the challenges is to capture genomic diversity representative of
differences within and between populations, especially for rare genetic diseases. In this study, we
present SynthDNASim, a tool for creating diverse synthetic DNA datasets. Our approach is to
create diverse DNA datasets taking into account factors of genetic evolution and ancestry with
Huntington’s disease (HD) as a use case. In particular, with HD variants from European, African,
and Middle Eastern populations. We will show our tool and future plans on applying semantic
methods and tools to make SynthDNASim more FAIR (Findable Accessible Interoperable
Reusable).
Keywords
Diverse synthetic DNA dataset, privacy, Huntington’s Disease, evolution, ancestry, FAIR,
semantics
1. Introduction
In biomedical research, it is common to perform numerous analyses of genomic data, for example, to
understand the cause of a particular disease, or genetic processes, or to identify gene variants. One of
the difficulties in these analyses is the collection, storage, use, and reuse of genomic data because an
individual's genomic data is private. Especially if an individual has a rare disease like Huntington’s
disease (HD) it is theoretically possible to retrace the DNA to this individual. Thus, there are
regulatory laws that protect the privacy of these individuals but hinder access to genomic data. One
solution to this is the development of bioinformatic tools to create diverse synthetic DNA data so that
researchers in biomedical research can create synthetic DNA data and make the research faster and
more reproducible. [1] A possible issue with creating a synthetic DNA dataset is that it needs to be
diverse enough to be representative of different populations. A single disease can have many different
genetic characteristics because of differences in and between populations and because genetic diseases
are characterized by their phenotype (symptoms). Thus, factors of genetic evolution and ancestry need
to be taken into account while creating a diverse DNA dataset. [2] In this study, we present
SynthDNASim, a tool for creating diverse synthetic DNA datasets. Our approach is to create diverse
DNA datasets with HD as a use case. In particular, with HD variants from European, African, and
Middle Eastern populations. We will show our workflow and future plans for synthetic DNA dataset
validation. The FAIR principles and semantics will be applied within this project to make the tool
understandable, reusable, reviewable, and open-source. HD is a rare disease that is hereditary and
causes degeneration of nerve cells in the brain. Because of this, HD has a great impact on the
functional abilities of an individual, resulting in movement, cognitive, and mental disorders. HD is
caused by an extended CAG repeat within the Huntingtin gene (HTT gene). [3]
________________________
SWAT4HCLS 2024: The 15th International Conference on Semantic Web Applications and Tools for Health Care and Life Science,
February 26–29, 2024, Leiden, Netherlands
EMAIL: caitlin@4medbox.eu (A. 1); rick@4medbox.eu (A. 2); n.queralt_rosinach@lumc.nl (A. 3) ORCID: 0009-0009-7863-0962
(A. 1); 0009-0004-3529-1159 (A. 2); 0000-0003-0169-8159 (A. 3)
©️ 2024 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. SynthDNASim tool
In Figure 1 we illustrate the SynthDNASim pipeline. The first step is the retrieval and pre-processing
of genomic information from different data sources: National Center for Biotechnology Information
(NCBI) for the SNP variants, NCBI for the sequence of Chromosome 4, and lastly the user input. Next
is a sequence of steps to create the synthetic DNA sequences per population. Python is used for the
user input, creating the config file (JSON file). Each sequence has its own metadata including
haplotype, genetic variants, CAG repeats, gene, chromosome, etc.
Figure 1: SynthDNASim pipeline.
3. Future works
The remaining work of this project is to perform a validation on the generated data and to use semantic
methods and tools to make the project more FAIR. One option for this is to create the metadata for the
output data and the tool. For the creation of the output metadata Data Catalog Vocabulary (DCAT) can
be used. [4]
4. Acknowledgements
We want to thank Alex Stikkelman and the 4MedBox team for all their support and help. We also
want to thank Ivo Fokkema and the Biosemantics group at the LUMC for their input and help in
this project. This project received funding from 4MedBox. N. Queralt-Rosinach is supported by
funding from the European Union’s Horizon 2020 research and innovation program under the EJP RD
COFUND-EJP N° 825575 and by a grant from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 847826 (Brain Involvement iN Dystrophinopathies
(BIND)). We would like to thank to the EJP RD and BIND for supporting research on generating
synthetic health data for rare diseases research.
5. References
[1] J. Walonoski, M. Kramer, J. Nichols, A. Quina, C. Moesel, D. Hall, C. Duffett, K. Dube, T.
Gallagher, S. McLachlan, Synthea: An approach, method, and software mechanism for generating
synthetic patients and the synthetic electronic health care record, J. Am. Med. Inform. Assoc. 25.3
(2017) 230–238. doi:10.1093/jamia/ocx079.
[2] F. Squitieri, T. Mazza, S. Maffi, A. De Luca, Q. AlSalmi, S. AlHarasi, J. A. Collins, C. Kay, F.
Baine-Savanhu, B. G. Landwhermeyer, et al., Tracing the mutated HTT and haplotype of the African
ancestor who spread Huntington disease into the Middle East, Genet. Med. 22.11 (2020) 1903–1908.
doi:10.1038/s41436-020-0895-1.
[3] A. B. Young, Huntingtin in health and disease, J. Clin. Investig. 111.3 (2003) 299–302.
doi:10.1172/jci17742.
[4] Albertoni R, Browning D, Cox S, et al. Data Catalog Vocabulary (DCAT) - Version 3. W3.org.
Published January 18, 2024. Accessed February 7, 2024. https://www.w3.org/TR/vocab-dcat-3/