Bringing Schemas to the Schemaless Mark A. Miller 1, Rebecca C. Jackson 2 and Christopher J. Mungall 1 1 Lawrence Berkeley National Laboratory, Berkeley, CA, USA 2 Bend Informatics, Bend, OR, USA Abstract Ad-hoc spreadsheets can be appealing for managing some kinds of scientific data, but they allow bad practices like inconsistent types (e.g., string, integer, decimal) within individual columns, as well as open-ended categorical values (e.g., cell shapes). We have developed tools that infer LinkML schemas from spreadsheets, asserting types and enumerated lists that can be used to validate the dataset as it grows. Our tools can also associate OBO Foundry term identifiers with an enumeration’s permissible values in the spirit of interoperability. Keywords 1 LinkML, spreadsheets, schemas, enumerations 1. Introduction As with many other industries, small to medium groups of scientists are likely to use a spreadsheet application for data collection in the early phases of a project. Spreadsheets with a single tab or worksheet offer a ubiquitous, familiar interface. They also enable convenient column-wise and row- wise copy and paste, plus formula-based data transformations. There are many well-characterized disadvantages to using spreadsheets for scientific data management [1]. The initial flexibility they offer can be rapidly eclipsed by the lack of formal typing of columns. Distinguishing between numeric, data and text columns is generally not that difficult. However, the determination of whether a text column is intended to represent categorical (or enumerable) entities or open-ended narrative is outside the scope of routine spreadsheet usage. We have developed a suite of tools as part of the LinkML framework to bootstrap discovering a semantic schema for an ad-hoc spreadsheet. When our methods infer that all values in one column come from a small set of permissible values, that column is modeled as an enum, and the strings can be mapped to terms from the OBO foundry. We utilize tools and standards that are from and by the OBO foundry developers’ community, like the Ontology Lookup Service API [2], SSSOM [3], and rdftab [4]. Since our methods can read and write LinkML [5] files in addition to spreadsheets and delimited text files, they open the door to LinkML’s support for numerous semantically oriented serialization and validation formats. Outcomes of processing spreadsheets with our methods can include: • Discussions that lead to handshake agreements about how new data will be added to the spreadsheet, along with sanitization of the existing data. • The decision to use LinkML as the team’s primary data tools for data management • The design of a database and web frontend for managing the data. LinkML can generate SQL DDL for defining the necessary tables. Alternatively, LinkML provides validation frameworks for data stored in noSQL formats like JSON and RDF. International Conference on Biomedical Ontologies 2021, September 16–18, 2021, Bozen-Bolzano, Italy EMAIL: MAM@lbl.gov (A. 1); rbca.jackson@gmail.com (A. 2); cjmungall@lbl.gov (A. 3) ORCID: 0000-0001-9076-6066 (A. 1); 0000-0002-6601-2165 (A. 2); 0000-0002-4232-6880 (A. 3) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) 2. Background We have developed a method for inferring a semantic model from the most common data serializations, including delimited text files. In the case of CSV or TSV files, it is assumed that each row represents one instance of a single class, each column header represents one predicate P, and the contents of each cell represent the object of a P relationship with that instance. Patterns observed in the columns can be used to specify the range of the predicates: all numerical columns would specify a numerical range, columns with high entropy would specify a very open type, like String, and columns with extensive re-use of a limited set of textual values would be interpreted as categorical objects, perhaps the labels of classes that are defined elsewhere. Methods are provided for proposing term identifiers with the same semantics as the observed labels. Specifically, the one term that best accounts for the longest span of cell contents is injected back into the inferred model. We provide support for running term mappings against locally downloaded files or over REST APIs, like that provided by OLS. We use SQLite for storing ontology content locally, so installing a database engine is not required. While the results are almost guaranteed to include some false positives and false negatives, the inclusion of lexical similarity metrics enables an expert reviewer to rapidly curate the mappings. Inferred models are serialized in YAML, following the LinkML framework. This serialization can be converted to a wide variety of formats familiar to ontologists, like RDF, OWL and JSON-LD, can also be used to generate Python Dataclasses, and can drive validation, when converted to SHEX or JSON schema. The mappings from categorical cell contents to term identifiers can also be saved as SSSOM tables. 3. Methods Our methods can be found in the turbomam/schemas-for-schemaless GitHub repository. There, we illustrate the usefulness of this schema inference and mapping workflow with a table of microbial traits from bacteria-archaea-traits/bacteria-archaea-traits. This repository aggregates over twenty sources of knowledge about bacterial and archaeal traits. One use case for the tabular outputs from this repository, such as condensed_species_NCBI.csv, is the construction of a knowledge graph such as Knowledge- Graph-Hub/kg-microbe. If the knowledge graph is to succeed in sustainably linking even more data sources, string values like “flask” the “cell_shape” column should be normalized to identifiers like MICRO:0000406, which bears the label “pear-shaped cell” and the exact synonym “flask”. We start with a Makefile that: • clones the linkml/linkml-model-enrichment GitHub repository • clones bacteria-archaea-traits/bacteria-archaea-traits applies the infer_model.py script from linkml-model-enrichment/linkml_model_enrichment to the output/ condensed_traits_NCBI.csv file from bacteria-archaea-traits/output/ data file. An inferred schema in the LinkML format, serialized as a YAML file, results • downloads the rdftab executable • downloads ontologies that we expected to be applicable to the bacterial traits file: pato.owl and uberon.owl • creates an SQLite database and populates it with statements from the downloaded OWL files, using rdftab The remainder of the workflow is implemented in Python: • term ID and label pairs are quickly retrieved from the rdftab SQLite database • The inferred LinkML schema for condensed_traits_NCBI.csv is parsed, and the permissible values for the cell_shape column are extracted. • The permissible values are sanitized with lowercasing, whitespace normalization and replacement of several punctuation characters. • A first pass of mapping the cell_shape_enum permissible values to OBO foundry terms is performed by merging the sanitized values against the labels obtained from the rdftab SQLite database. • Sanitized cell_shape_enum permissible values that don’t match any labels from the rdftab SQLite database are submitted to the OLS search API • Additional metadata for each matching class is retrieved with a. sperate API call in order to determine what kind of relationship justified the match (exact synonym from Genbank, close synonym, etc.) • One best OLS hit for each Sanitized cell_shape_enum is retained, based on relevance and string similarity metrics. • The term mappings from the rdftab SQLite and OLS paths are merged, and any Sanitized cell_shape_enum permissible values that failed to map are reported • The mappings are injected back into the LinkML model 4. Results condensed_traits_NCBI.csv consists of 172,325 taxon rows x 35 columns of identifiers, metadata and traits. infer_model.py automatically infers that the following columns take enumerable values: • cell_shape • data_source • gram_stain • metabolism • motility • rRNA16S_genes • range_salinity • range_tmp • sporulation • superkingdom infer_model.py can be forced to consider other columns as enumerable with the --enum-columns flag. The cell_shape values in condensed_traits_NCBI.csv are already well normalized and the sanitization step does not shrink the number of strings that require mappings: • bacillus • branced • coccobacillus • coccus • disc filament • flask • fusiform • irregular • NA • pleomorphic • ring • spindle • spiral Four terms obtain mappings via merging with the rdftab database. (See Jupyter notebook for an accounting [6].) All of the remaining values obtain hits from the OLS search. However, seven values do not retrieve any acceptable mappings. For example, the cosine string similarity between the sanitized query and the best-matching annotation for the best matching term may be greater than our default cutoff of 0.05. Seven of the OLS matches come from MICRO and one each comes from OMP or NCBITAXON. 5. Discussion Our method rapidly infers a schema from the bacterial and archaeal traits table and discovers columns that appear to take enumerable values. Two methods are provided and demonstrated for mapping the string representations of the enumerable values to terms from OBO Foundry ontologies. The selection of those two methods is informed by one’s tolerance for downloading ontologies to local storage vs depending upon a REST API. Visual inspection of the mappings suggests that they are all sound. Mapping failures can be attributed to misspellings in the input (e.g., branced instead of branched) or limitations placed on the search space. For example, our OLS search was limited to NCBITAXON, MICRO and OMP. Had FMA been included, the cell morphology FMA:70989 would have been discovered for the sanitized input ‘spindle’. Prioritizing quality over quantity, we were conservative in our selection of ontologies. We have found that these methods work well for a variety of sources, including the INSDC BioSample metadata collection [7]. BioSample contains ~ 20 million records and ~ 500 entity types and attributes, although we have only mapped a few attributes from tens of thousands of records at any one time. Qualitatively speaking, the sensitivity and specificity of mapping gene names and symbols to OBO Foundry terms has not been as good as it is for microbial traits, sequence features, and environmental features. We are considering the addition of searches against gene or protein specific databases outside of the OBO Foundry, like UniProt or Entrez Gene. Whether we would take a local download approach or a REST API approach needs to be determined. 6. Acknowledgements This work is supported in part by the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) under contract number DE- AC02-05CH11231 (LBNL). 7. References [1] Ghada AlTarawneh, Simon Thorne, A Pilot Study Exploring Spreadsheet Risk in Scientific Research, in: Proceedings of the EuSpRIG 2016 Conference “Spreadsheet Risk Management”, pp. 49-69. ISBN: 978-1-905404-53-7. URL: https://arxiv.org/abs/1703.09785 [2] URL: https://www.ebi.ac.uk/ols/docs/api [3] URL: https://github.com/mapping-commons/SSSOM [4] URL: https://github.com/ontodev/rdftab.rs [5] URL: https://github.com/linkml/linkml [6] URL: https://github.com/turbomam/schemas-for-schemaless/blob/main/notebooks/schemas- for-schemaless.ipynb [7] URL: https://www.ncbi.nlm.nih.gov/biosample/