=Paper= {{Paper |id=Vol-3073/paper24 |storemode=property |title=The Linked Data Modeling Language (LinkML): A General-Purpose Data Modeling Framework Grounded in Machine-Readable Semantics |pdfUrl=https://ceur-ws.org/Vol-3073/paper24.pdf |volume=Vol-3073 |authors=Sierra Moxon,Harold Solbrig,Deepak Unni,Dazhi Jiao,Richard Bruskiewich,James Balhoff,Gaurav Vaidya,William D. Duncan,Harshad Hegde,Mark Miller,Matthew Brush,Nomi Harris,Melissa Haendel,Christopher J. Mungall |dblpUrl=https://dblp.org/rec/conf/icbo/MoxonSUJBBVDHMB21 }} ==The Linked Data Modeling Language (LinkML): A General-Purpose Data Modeling Framework Grounded in Machine-Readable Semantics== https://ceur-ws.org/Vol-3073/paper24.pdf
The Linked Data Modeling Language (LinkML): A General-
Purpose Data Modeling Framework Grounded in Machine-
Readable Semantics
Sierra Moxon1, Harold Solbrig2, Deepak Unni3, Dazhi Jiao2, Richard Bruskiewich4, James
Balhoff5, Gaurav Vaidya5, William Duncan1, Harshad Hegde1, Mark Miller1, Matthew
Brush6, Nomi Harris1, Melissa Haendel6, and Christopher Mungall1

1
  Lawrence Berkeley National Laboratory, Berkeley, CA, USA
2
  Johns Hopkins University, Baltimore, MD, USA
3
  European Molecular Biology Laboratory, Heidelberg, Germany
4
  Star Informatics, Victoria, BC, Canada
5
  RENCI, Chapel Hill, NC, USA
6
  University of Colorado, Denver, CO, USA


                                  Abstract
                                  Data integration is a major challenge in the life sciences, due to heterogeneity, complexity, the
                                  proliferation of ad-hoc formats and data structures, and poor compliance with FAIR guidelines.
                                  The Linked data Modeling Language (LinkML, https://linkml.github.io) is an object-oriented
                                  data modeling framework that aims to bring semantic web standards to the masses, simplifying
                                  the production of FAIR ontology-ready data. It can be used for schematizing a variety of kinds
                                  of data, ranging from simple flat checklist-style standards to complex interrelated normalized
                                  data utilizing polymorphism/inheritance. Although it is still a young and evolving standard,
                                  LinkML is already in use across a wide variety of projects with different applications including
                                  cancer data harmonization, environmental genomics, and knowledge graph integration.

                                  Keywords 1
                                  Ontology, semantic web, RDF, JSON-schema



1. Introduction
     Data integration is a major challenge in the life sciences. In principle ontologies and semantic web
formats can help address the problem of data integration, but these technologies are not sufficient in
themselves. Having an ontology for a domain does not guarantee that data can be exchanged robustly,
and semantic web standards are built on the open-world assumption, whereas for most database use
cases closed-world constraints are required.
    The Linked data Modeling Language (LinkML [1], https://linkml.github.io) is an object-oriented
data modeling framework that aims to bring semantic web standards to the masses, simplifying the
production of FAIR [2] ontology-ready data. It is intended to be used for schematizing a variety of kinds
of data, ranging from simple flat checklist-style standards to complex interrelated normalized data
utilizing polymorphism/inheritance. Although it is still a young and evolving standard, it is already in

International Conference on Biomedical Ontologies 2021, September 16–18, 2021, Bozen-Bolzano, Italy
EMAIL: smoxon@lbl.gov (A. 1); solbrig@jhu.edu (A. 2);                 deepak.unni3@gmail.com (A. 3); djiao@jhu.edu (A. 4),
richard.bruskiewich@delphinai.com (A. 5); balhoff@renci.org (A. 6); gaurav@ggvaidya.com (A. 7); wdduncan@lbl.gov (A. 8);
hhedge@lbl.gov (A. 9); mam@lbl.gov (A. 10); matt@tislab.org (A. 11); melissa@tislab.org (A. 12); cjmungall@lbl.gov (A. 13)
ORCID: 0000-0002-8719-7760 (A. 1); 0000-0002-5928-3071 (A. 2); 0000-0002-3583-7340 (A. 3); 0000-0001-5052-3836 (A. 4); 0000-
0002-4447-5978 (A. 5); 0000-0002-8688-6599 (A. 6); 0000-0003-0587-0454 (A. 7); 0000-0001-9625-1899 (A. 8); 0000-0002-2411-565X
(A. 9); 0000-0001-9076-6066 (A. 10); 0000-0002-1048-5019 (A. 11); 0000-0001-9114-8737 (A. 12); 0000-0002-6601-2165 (A. 13)
                               © 2020 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                               CEUR Workshop Proceedings (CEUR-WS.org)
    CEUR
                ht
                 tp:
                   //
                    ceur
                       -ws
                         .or
                           g
    Works
        hop     I
                SSN1613-
                       0073
    Pr
     oceedi
          ngs
use across a wide variety of projects with different applications including cancer data harmonization,
environmental genomics, and knowledge graph integration.


2. LinkML Structure

   LinkML is designed to fit in well with frameworks familiar to most developers and database
engineers -- JSON files, relational databases, document stores, Python object models -- and at the same
time provide a solid semantic underpinning by mapping all elements to RDF URIs. LinkML’s formal
RDF-based framework allows semantics to hide in plain sight, while also making it easy for both
domain and technical experts to design schemas in a shared platform.
   An example of a simple schema represented in LinkML is shown in Figure 1.




Figure 1: An example of LinkML syntax.

   The basic structure is a schema plus associated metadata (including namespace to URI mapping), a
set of classes, plus their attributes. Classes follow object-oriented semantics rather than OWL
semantics, and classes can be metaclasses -- i.e., a LinkML schema can be used to model the design
patterns in an ontology, with instances being OWL [3] classes. Each element in the schema can be
assigned URIs from existing vocabularies, allowing for increased integration via semantic web
standards.
   LinkML favors ontologies over free text and gives information meaning by establishing identity via
resolvable URIs. The framework allows the modeler to model both open and closed world assumptions,
and when operating in a closed world, provides ways to validate and constrain schema instances and
their relations (in a variety of different modeling paradigms like JSON-Schema, SQL-DDL, etc.). In
addition, the LinkML language itself reuses existing semantic standards. For example, it provides
modelers with a variety of mapping terms from the Simple Knowledge Organization System
Namespace (SKOS) [4] (e.g., the broad_mapping relation, https://linkml.github.io/linkml-
model/docs/broad_mappings.html,             implements          the        broadMatch         predicate,
https://www.w3.org/2009/08/skos-reference/skos.html#broadMat). These formalisms allow the
flexibility to extend or reuse existing object definitions while at the same time easily mapping data to
existing standards where appropriate (e.g., a ‘gene’ object in one LinkML schema can be mapped
directly to another LinkML schema’s representation of a ‘gene’ via ‘skos:exact_match’ predicates.).
   LinkML tooling is another important piece of this framework. LinkML generators provide automatic
translations from the schema YAML to a growing number of other formats, including:
    ● JSON-schema[5]
    ● JSON-LD/RDF[6]
    ● SQL DDL
    ● ShEx[7]
    ● GraphQL[8]
    ● Python data classes
    ● Markdown[9]
    ● UML diagrams[10]

    This automated translation allows tooling from these frameworks to be easily reused and combined.
For example, JSON-Schema provides robust validators, and these can be used for any LinkML schema.
    The LinkML runtime provides loaders (https://github.com/linkml/linkml-runtime) and dumpers
(https://github.com/linkml/linkml-runtime) to convert instances of the schema between these formats.
And, because LinkML also generates (Python) class instances it provides a clear path to distributing
data (via API or one of the formats native to LinkML like JSON, TSV, etc.) in the same well-defined
format. LinkML tooling even auto-generates markdown documentation and UML diagrams from the
schema YAML. The growing collection of LinkML schemas can be found at the LinkML schema
registry (https://github.com/linkml/linkml-registry).


3. Use Cases

LinkML is already being used in a range of projects, including:

    ●   National Microbiome Data Collaborative (https://microbiomedata.org/,
        https://github.com/microbiomedata/nmdc-schema), for storing environmental microbiome
        studies, associated samples, biogeochemical and environmental parameters, and associated
        omics datasets and function predictions
    ●   Center for Cancer Data Harmonization (https://datascience.cancer.gov/data-commons/center-
        cancer-data-harmonization-ccdh, https://github.com/cancerDHC/ccdhmodel), for human
        patient and cancer sample data plus associated omics and imaging data
    ●   The NCATS Biomedical Data Translator (https://ncats.nih.gov/translator,
        https://github.com/biolink/biolink-model), for integrating multiple knowledge graphs through
        the LinkML-authored Biolink schema
    ●   The Alliance of Genome Resources (https://alliancegenome.org, https://github.com/alliance-
        genome/agr_curation_schema) for modeling complex model organism data for a persistent
        curation store
    ●   The https://github.com/biodatamodels project, collecting schemas for core bioinformatics data
        formats, including GFF3


    In summary, LinkML is a modeling framework that allows computers and people to work
cooperatively: it is platform agnostic, compilable down to RDF, easy to use by both domain and
technical experts, self-documenting and allows modelers to map common concepts to other well-
defined resources and models. Most importantly, LinkML is a modeling framework that makes it easy
to store, validate, and distribute data that is reusable and interoperable.
4. Acknowledgments

   This work is supported in part by the Genomic Science Program in the U.S. Department of Energy,
Office of Science, Office of Biological and Environmental Research (BER) under contract number DE-
AC02-05CH11231 (LBNL). Additional support was provided by NIH OD R24 OD011883, NHGRI
Center of Excellence in Genome Sciences RM1 HG010860, NHGRI 5U01HG009453-03, and NCI IAA
#ACO21007-001-00000.


5. References
[1] URL: https://github.com/linkml/linkml
[2] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for
scientific data management and stewardship. Sci Data 3, 160018 (2016).
https://doi.org/10.1038/sdata.2016.18
[3] https://www.w3.org/TR/owl2-manchester-syntax/
[4] https://www.w3.org/2009/08/skos-reference/skos.html
[5] https://json-schema.org/
[6] https://shex.io/
[7] https://json-ld.org/
[8] https://graphql.org/
[9] https://www.markdownguide.org/
[10] https://www.uml-diagrams.org/