The Application of Semantic Resources and Technologies for the Discovery and Integration of Geo- and Biosciences Data Michael Diepenbroek 1 1 MARUM, University of Bremen, Bremen, Germany Abstract Large-scale and complex questions in science, such as global warming, invasive species spread, and resource depletion, increasingly require the collection of disparate data sets from various data sources building on different knowledge domains in science and society. Structured data and metadata with consistent semantics are prerequisites for data usability, in particular for findability of data and efficient data integration. Ontologies, thesauri, and vocabularies for various science domains have been evolving tremendously during the last decade and play a key role for the harmonization of data. Nevertheless, the application of terminologies in the context of data production, archiving, and publication is still at its beginning. In addition, features and usability of terminology services vary greatly. The situation is aggravated by the complexity and dynamic growth of measurement and observation types (parameters) including used methods which are essential for integrating data from distributed sources. The ISC World Data Center PANGAEA (www.pangaea.de) with ~200.000 parameters and methods linked to more than 400.000 data sets covers a large part of scientific fields in the earth and environmental sciences. For the harmonization of parameters and methods PANGAEA has (1) embedded a term catalogue (TC) comprising various relevant terminologies including taxonomies into its editorial system, (2) has conceptualized parameters and methods by setting up a basic syntax and rule set, and (3) has implemented routines based on full text search for matching parameter and method names with terms from the TC. Despite these measures being quite successful it must be noted that the approach is limited to PANGAEA as a single data provider - the needed effort is high. More recently, Germany launched the National Research Data Infrastructure (NFDI - https://www.nfdi.de/) initiative with a number of consortia covering various science domains. The NFDI4BioDiversity (www.nfdi4biodiversity.org) consortium, having started in 2020, leads the development of a multi-cloud-based infrastructure supported by almost all existing consortia. The so-called NFDI Research Data Commons (RDC) will enable uniform access to data, software, and compute resources as well as sovereign data exchange and collaborative work. Harmonization of the semantics of data during the ETL process will be supported by terminology services enhanced by AI technologies. With the RDC, a paradigm shift from data to function shipping is initiated. The approach aligns well with initiatives like the European Open Science Cloud, EOSC (https://eosc-portal.eu/), the NCI RDC (https://datascience.cancer.gov/data-commons) or the Australian RDC (https://ardc.edu.au/). Nevertheless, integrating and harmonizing data into the conceived cloud based systems remains a major challenge: (1) Terminology services need to improve in quality and functionality; (2) AI technologies are not yet part of the integration process; (3) more convergence towards cross-domain usable metadata standards like schema.org - allowing community specific extensions like BioSchemas (https://bioschemas.org/) - is needed to keep the ETL process manageable; and finally (4) a comprehensible, generally applicable model for the definition of parameters and methods would make the task considerably easier. The latter should build in part on the UCUM system (https://ucum.org/) for scientific units. 1 S4BioDiv 2021: 3rd International Workshop on Semantics for Biodiversity, held at JOWO 2021: Episode VII The Bolzano Summer of Knowledge, September 11-18, 2021, Bolzano, Italy © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) Bibliography Dr. Michael Diepenbroek, Geologist and IT Specialist. 1992 PhD in Geology at the Free University of Berlin; 1992-94 computer center of the AWI, Bremerhaven; 1994-97 conception and implementation the scientific information system PANGAEA®; 1998-2021 at MARUM, University Bremen, where he was responsible for the management of PANGAEA®. During the last 10 years he took a leading role establishing PANGAEA as a global service provider for scientific data, in particular through mandates from the ISC (ISC World Data System - Vice-Chair of the Scientific Committee 2009-2016), the WMO (WMO Information System), collaborations with major science publishers, and as Chair/Co-chair in various RDA groups. Coordinator of the German Federation for Biological Data (GFBio) (2013-2020). Since 2017 engaged in the National Research Data Infrastructure Initiative (NFDI), in particular in the conception and preparation of NFDI4BioDiversity (https://www.nfdi4biodiversity.org). Since 2021 working for GFBio e.V. as part of NFDI4BioDiversity.