<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Linked Data Modeling Language (LinkML): A General- Purpose Data Modeling Framework Grounded in Machine- Readable Semantics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sierra Moxon</string-name>
          <email>smoxon@lbl.gov</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harold Solbrig</string-name>
          <email>solbrig@jhu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deepak Unni</string-name>
          <email>deepak.unni3@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dazhi Jiao</string-name>
          <email>djiao@jhu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Bruskiewich</string-name>
          <email>richard.bruskiewich@delphinai.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Balhoff</string-name>
          <email>balhoff@renci.org</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaurav Vaidya</string-name>
          <email>gaurav@ggvaidya.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William Duncan</string-name>
          <email>wdduncan@lbl.gov</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harshad Hegde</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Miller</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Brush</string-name>
          <email>matt@tislab.org</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nomi Harris</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Melissa Haendel</string-name>
          <email>melissa@tislab.org</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Mungall</string-name>
          <email>cjmungall@lbl.gov</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Molecular Biology Laboratory</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Johns Hopkins University</institution>
          ,
          <addr-line>Baltimore, MD</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lawrence Berkeley National Laboratory</institution>
          ,
          <addr-line>Berkeley, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>RENCI</institution>
          ,
          <addr-line>Chapel Hill, NC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Star Informatics</institution>
          ,
          <addr-line>Victoria, BC</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Colorado</institution>
          ,
          <addr-line>Denver, CO</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data integration is a major challenge in the life sciences, due to heterogeneity, complexity, the proliferation of ad-hoc formats and data structures, and poor compliance with FAIR guidelines. The Linked data Modeling Language (LinkML, https://linkml.github.io) is an object-oriented data modeling framework that aims to bring semantic web standards to the masses, simplifying the production of FAIR ontology-ready data. It can be used for schematizing a variety of kinds of data, ranging from simple flat checklist-style standards to complex interrelated normalized data utilizing polymorphism/inheritance. Although it is still a young and evolving standard, LinkML is already in use across a wide variety of projects with different applications including cancer data harmonization, environmental genomics, and knowledge graph integration.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Ontology</kwd>
        <kwd>semantic web</kwd>
        <kwd>RDF</kwd>
        <kwd>JSON-schema</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Data integration is a major challenge in the life sciences. In principle ontologies and semantic web
formats can help address the problem of data integration, but these technologies are not sufficient in
themselves. Having an ontology for a domain does not guarantee that data can be exchanged robustly,
and semantic web standards are built on the open-world assumption, whereas for most database use
cases closed-world constraints are required.</p>
      <p>
        The Linked data Modeling Language (LinkML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], https://linkml.github.io) is an object-oriented
data modeling framework that aims to bring semantic web standards to the masses, simplifying the
production of FAIR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] ontology-ready data. It is intended to be used for schematizing a variety of kinds
of data, ranging from simple flat checklist-style standards to complex interrelated normalized data
utilizing polymorphism/inheritance. Although it is still a young and evolving standard, it is already in
use across a wide variety of projects with different applications including cancer data harmonization,
environmental genomics, and knowledge graph integration.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. LinkML Structure</title>
      <p>LinkML is designed to fit in well with frameworks familiar to most developers and database
engineers -- JSON files, relational databases, document stores, Python object models -- and at the same
time provide a solid semantic underpinning by mapping all elements to RDF URIs. LinkML’s formal
RDF-based framework allows semantics to hide in plain sight, while also making it easy for both
domain and technical experts to design schemas in a shared platform.</p>
      <p>An example of a simple schema represented in LinkML is shown in Figure 1.</p>
      <p>
        The basic structure is a schema plus associated metadata (including namespace to URI mapping), a
set of classes, plus their attributes. Classes follow object-oriented semantics rather than OWL
semantics, and classes can be metaclasses -- i.e., a LinkML schema can be used to model the design
patterns in an ontology, with instances being OWL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] classes. Each element in the schema can be
assigned URIs from existing vocabularies, allowing for increased integration via semantic web
standards.
      </p>
      <p>
        LinkML favors ontologies over free text and gives information meaning by establishing identity via
resolvable URIs. The framework allows the modeler to model both open and closed world assumptions,
and when operating in a closed world, provides ways to validate and constrain schema instances and
their relations (in a variety of different modeling paradigms like JSON-Schema, SQL-DDL, etc.). In
addition, the LinkML language itself reuses existing semantic standards. For example, it provides
modelers with a variety of mapping terms from the Simple Knowledge Organization System
Namespace (SKOS) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (e.g., the broad_mapping relation,
https://linkml.github.io/linkmlmodel/docs/broad_mappings.html, implements the broadMatch predicate,
https://www.w3.org/2009/08/skos-reference/skos.html#broadMat). These formalisms allow the
flexibility to extend or reuse existing object definitions while at the same time easily mapping data to
existing standards where appropriate (e.g., a ‘gene’ object in one LinkML schema can be mapped
directly to another LinkML schema’s representation of a ‘gene’ via ‘skos:exact_match’ predicates.).
      </p>
      <p>
        LinkML tooling is another important piece of this framework. LinkML generators provide automatic
translations from the schema YAML to a growing number of other formats, including:
● JSON-schema[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
● JSON-LD/RDF[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
● SQL DDL
● ShEx[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
● GraphQL[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
● Python data classes
● Markdown[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
● UML diagrams[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
      </p>
      <p>This automated translation allows tooling from these frameworks to be easily reused and combined.
For example, JSON-Schema provides robust validators, and these can be used for any LinkML schema.</p>
      <p>The LinkML runtime provides loaders (https://github.com/linkml/linkml-runtime) and dumpers
(https://github.com/linkml/linkml-runtime) to convert instances of the schema between these formats.
And, because LinkML also generates (Python) class instances it provides a clear path to distributing
data (via API or one of the formats native to LinkML like JSON, TSV, etc.) in the same well-defined
format. LinkML tooling even auto-generates markdown documentation and UML diagrams from the
schema YAML. The growing collection of LinkML schemas can be found at the LinkML schema
registry (https://github.com/linkml/linkml-registry).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Use Cases</title>
      <p>LinkML is already being used in a range of projects, including:
●
●
●
●
●</p>
      <p>National Microbiome Data Collaborative (https://microbiomedata.org/,
https://github.com/microbiomedata/nmdc-schema), for storing environmental microbiome
studies, associated samples, biogeochemical and environmental parameters, and associated
omics datasets and function predictions
Center for Cancer Data Harmonization
(https://datascience.cancer.gov/data-commons/centercancer-data-harmonization-ccdh, https://github.com/cancerDHC/ccdhmodel), for human
patient and cancer sample data plus associated omics and imaging data
The NCATS Biomedical Data Translator (https://ncats.nih.gov/translator,
https://github.com/biolink/biolink-model), for integrating multiple knowledge graphs through
the LinkML-authored Biolink schema
The Alliance of Genome Resources (https://alliancegenome.org,
https://github.com/alliancegenome/agr_curation_schema) for modeling complex model organism data for a persistent
curation store
The https://github.com/biodatamodels project, collecting schemas for core bioinformatics data
formats, including GFF3</p>
      <p>In summary, LinkML is a modeling framework that allows computers and people to work
cooperatively: it is platform agnostic, compilable down to RDF, easy to use by both domain and
technical experts, self-documenting and allows modelers to map common concepts to other
welldefined resources and models. Most importantly, LinkML is a modeling framework that makes it easy
to store, validate, and distribute data that is reusable and interoperable.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgments</title>
      <p>This work is supported in part by the Genomic Science Program in the U.S. Department of Energy,
Office of Science, Office of Biological and Environmental Research (BER) under contract number
DEAC02-05CH11231 (LBNL). Additional support was provided by NIH OD R24 OD011883, NHGRI
Center of Excellence in Genome Sciences RM1 HG010860, NHGRI 5U01HG009453-03, and NCI IAA
#ACO21007-001-00000.
5. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] URL: https://github.com/linkml/linkml</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aalbersberg</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          et al.
          <article-title>The FAIR Guiding Principles for scientific data management and stewardship</article-title>
          .
          <source>Sci Data</source>
          <volume>3</volume>
          ,
          <issue>160018</issue>
          (
          <year>2016</year>
          ). https://doi.org/10.1038/sdata.
          <year>2016</year>
          .18
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>[3] https://www.w3.org/TR/owl2-manchester-syntax/</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] https://www.w3.org/
          <year>2009</year>
          /08/skos-reference/skos.html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>[5] https://json-schema.org/</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>[6] https://shex.io/</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>[7] https://json-ld.org/</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>[8] https://graphql.org/</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>[9] https://www.markdownguide.org/</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[10] https://www.uml-diagrams.org/</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>