<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Updating The SynthDNASim tool to create diverse synthetic DNA datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Caitlin Jenster</string-name>
          <email>caitlin@4medbox.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rick Overkleeft</string-name>
          <email>rick@4medbox.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Núria Queralt-Rosinach</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>4MedBox Nederland B.V</institution>
          ,
          <addr-line>Kanaalpark 157, Leiden, 2321 JW</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leiden University Medical Center</institution>
          ,
          <addr-line>Albinusdreef 2, Leiden, 2333 ZA, Netherlands 9</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Applied science Leiden</institution>
          ,
          <addr-line>Zernikedreef 11, Leiden, 2333 CK</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In biomedical research, it is common to perform numerous analyses of genomic data, for example, to understand the cause of a particular disease. Regulatory laws protect the privacy of individuals but hinder access to genomic data. One solution to this is the development of bioinformatic tools to create synthetic DNA data. One of the challenges is to capture genomic diversity representative of differences within and between populations, especially for rare genetic diseases. In this study, we present SynthDNASim, a tool for creating diverse synthetic DNA datasets. Our approach is to create diverse DNA datasets taking into account factors of genetic evolution and ancestry with Huntington's disease (HD) as a use case. In particular, with HD variants from European, African, and Middle Eastern populations. We will show our tool and future plans on applying semantic methods and tools to make SynthDNASim more FAIR (Findable Accessible Interoperable Reusable).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Diverse synthetic DNA dataset</kwd>
        <kwd>privacy</kwd>
        <kwd>Huntington's Disease</kwd>
        <kwd>evolution</kwd>
        <kwd>ancestry</kwd>
        <kwd>FAIR</kwd>
        <kwd>semantics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In biomedical research, it is common to perform numerous analyses of genomic data, for example, to
understand the cause of a particular disease, or genetic processes, or to identify gene variants. One of
the difficulties in these analyses is the collection, storage, use, and reuse of genomic data because an
individual's genomic data is private. Especially if an individual has a rare disease like Huntington’s
disease (HD) it is theoretically possible to retrace the DNA to this individual. Thus, there are
regulatory laws that protect the privacy of these individuals but hinder access to genomic data. One
solution to this is the development of bioinformatic tools to create diverse synthetic DNA data so that
researchers in biomedical research can create synthetic DNA data and make the research faster and
more reproducible. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] A possible issue with creating a synthetic DNA dataset is that it needs to be
diverse enough to be representative of different populations. A single disease can have many different
genetic characteristics because of differences in and between populations and because genetic diseases
are characterized by their phenotype (symptoms). Thus, factors of genetic evolution and ancestry need
to be taken into account while creating a diverse DNA dataset. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] In this study, we present
SynthDNASim, a tool for creating diverse synthetic DNA datasets. Our approach is to create diverse
DNA datasets with HD as a use case. In particular, with HD variants from European, African, and
Middle Eastern populations. We will show our workflow and future plans for synthetic DNA dataset
validation. The FAIR principles and semantics will be applied within this project to make the tool
understandable, reusable, reviewable, and open-source. HD is a rare disease that is hereditary and
causes degeneration of nerve cells in the brain. Because of this, HD has a great impact on the
functional abilities of an individual, resulting in movement, cognitive, and mental disorders. HD is
caused by an extended CAG repeat within the Huntingtin gene (HTT gene). [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>________________________</p>
    </sec>
    <sec id="sec-2">
      <title>2. SynthDNASim tool</title>
      <p>In Figure 1 we illustrate the SynthDNASim pipeline. The first step is the retrieval and pre-processing
of genomic information from different data sources: National Center for Biotechnology Information
(NCBI) for the SNP variants, NCBI for the sequence of Chromosome 4, and lastly the user input. Next
is a sequence of steps to create the synthetic DNA sequences per population. Python is used for the
user input, creating the config file (JSON file). Each sequence has its own metadata including
haplotype, genetic variants, CAG repeats, gene, chromosome, etc.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Future works</title>
      <p>
        The remaining work of this project is to perform a validation on the generated data and to use semantic
methods and tools to make the project more FAIR. One option for this is to create the metadata for the
output data and the tool. For the creation of the output metadata Data Catalog Vocabulary (DCAT) can
be used. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgements</title>
      <p>We want to thank Alex Stikkelman and the 4MedBox team for all their support and help. We also
want to thank Ivo Fokkema and the Biosemantics group at the LUMC for their input and help in
this project. This project received funding from 4MedBox. N. Queralt-Rosinach is supported by
funding from the European Union’s Horizon 2020 research and innovation program under the EJP RD
COFUND-EJP N° 825575 and by a grant from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 847826 (Brain Involvement iN Dystrophinopathies
(BIND)). We would like to thank to the EJP RD and BIND for supporting research on generating
synthetic health data for rare diseases research.</p>
    </sec>
    <sec id="sec-5">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Walonoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Quina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Moesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duffett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dube</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. McLachlan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Synthea:</surname>
          </string-name>
          <article-title>An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record</article-title>
          ,
          <source>J. Am. Med. Inform. Assoc. 25.3</source>
          (
          <year>2017</year>
          )
          <fpage>230</fpage>
          -
          <lpage>238</lpage>
          . doi:
          <volume>10</volume>
          .1093/jamia/ocx079.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Squitieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mazza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maffi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Luca</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>AlSalmi</surname>
            , S. AlHarasi, J. A. Collins,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Kay</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Baine-Savanhu</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          <string-name>
            <surname>Landwhermeyer</surname>
          </string-name>
          , et al.,
          <article-title>Tracing the mutated HTT and haplotype of the African ancestor who spread Huntington disease into the Middle East</article-title>
          ,
          <source>Genet. Med</source>
          .
          <volume>22</volume>
          .11 (
          <year>2020</year>
          )
          <fpage>1903</fpage>
          -
          <lpage>1908</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41436-020-0895-1.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <article-title>Huntingtin in health and disease</article-title>
          ,
          <source>J. Clin. Investig. 111.3</source>
          (
          <year>2003</year>
          )
          <fpage>299</fpage>
          -
          <lpage>302</lpage>
          . doi:
          <volume>10</volume>
          .1172/jci17742.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Albertoni</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Browning</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cox</surname>
            <given-names>S</given-names>
          </string-name>
          , et al.
          <article-title>Data Catalog Vocabulary (DCAT) - Version 3</article-title>
          . W3.org.
          <source>Published January 18</source>
          ,
          <year>2024</year>
          .
          <source>Accessed February 7</source>
          ,
          <year>2024</year>
          . https://www.w3.org/TR/vocab-dcat-3/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>