Meaningful Data Interoperability and Reuse among Heterogeneous Scientific Communities © Nikolay Skvortsov Institute of Informatics Problems, Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, Moscow, Russia nskv@mail.ru Abstract. FAIR data principles declare data interoperability and reuse through the use of machine and human readable specifications. Adherence to these principles has some subsequences for data infrastructures and research communities. Meaningful data exchange and reuse by humans and machines requires formal specifications of subject domains accompanying data and allowing automatic inference. Development of formal conceptual specifications in research communities might be stimulated by a necessity to reach semantic interoperability of data collections and component, reuse of data resources. Data lifecycle hence includes collecting domain knowledge specifications, classifying all data, methods and services by these specifications, collecting and sharing them for reuse. Formal inference allows meaningful search and verified reuse of data, methods and services from collections. Keywords: FAIR data principles, conceptual modeling, research community with ontologies, accompanied by provenance 1 Introduction information, and be comply with known data models, or have known mapping to them. Curation and sharing research data to make it reusable FAIR data principles have been defined informally. for both human and machine is a topical issue for last So they rises a number of different interpretations, years. For example, WF4Ever project [1] is aimed at including application of Linked Data principles to preserving data, workflows and research results for their provide FAIR ones [5], or lists of more detailed informal sharing and reuse. Research objects are declared as requirements based on FAIR ones [6], or just simplified containers that encapsulate data, metadata, workflows, numerical rating of conformity with FAIR principles [7]. documentation, links to external resources and share all At the same time, it seems that FAIR data principles resources related to a research for a community. should have some definite subsequences for requirement Collaborative data infrastructures support sharing of to research data infrastructures. Ones relevant to data various resources such as collections, archives, semantics problems with respect to research databases, storage and computing capacities, and provide communities are discussed in this talk. services to search, access and manage them. For example, EUDAT [2] is a network of numerous 2 Subject domain specifications community specific data repositories and some of Europe’s largest data centers using common data FAIR data principles declare data interoperability and services for data and service providers and research reuse through the use of machine and human readable communities. EUDAT Collaborative Data Infrastructure specifications. It means that data are FAIR if only there (CDI) is a European infrastructure of integrated data is an approach to define and clarify semantics of data in services and resources to support research. some domains of knowledge. Meaningful data exchange Heterogeneous research data infrastructure interact to and reuse by machines (helpful for humans too) requires share research data globally and make science open. quite formal specifications of subject domains allowing EOSC [3] initiative integrates services and data from automatic inference. research data infrastructures, provides curation and Similarity and machine learning approaches could be preservation of scientific data repositories, computing applied to help humans search and operate with data but capacity for research data analysis. do not define formal specifications of the used resources FAIR data principles [4] has gathered basic features and evidence-based inference over metadata. Domain used in data curation and preservation and now are being knowledges should define restrictions and permissible propagated in research data infrastructures and open states of data from the view of specific domain. science. These principles are aimed to provide data Advanced ontological and rule models should be used for interoperability and reuse by machines and humans. For metdata development. this purpose datag should be well identified, specified Conceptualization and conceptual specifications are necessary not only in general domains, but in domains of Proceedings of the XX International Conference interest of narrower and more specialized communities, as “Data Analytics and Management in Data Intensive well as in overlapping domains, in which cooperation of Domains” (DAMDID/RCDL’2018), Moscow, Russia, research teams and reuse of specifications often occurs. October 9-12, 2018 14 Most researches are held on intersection of several domains, Community members (humans or machines) operate so they use constraints of several domains simultaneously within the ontological commitment defined by shared as points of view to specify research objects. Inference in ontologies, i, e. use of the concepts of the subject domain multidomain specifications should provide establishing in a consistent way with respect to the theories specified relations and semantic interoperability between data by the ontologies. Ontologies are important for the belonging to different domains. automation of consistency control on any manipulations with the domain concepts. An interaction of communities 3 Collections of methods and experiment in solving interdisciplinary problems requires specifications simultaneous querying using different domain vocabularies. In that case, the researchers should commit For comprehensive investigations of specific real-world to the specifications of several domains. entities, it is important to share data, tools, research results, Activities of communities are defined by data lifecycle methods and specifications defining the semantics of to provide their interoperability and reuse in related entities and phenomena in the domain as well as the domains. Maintenance of shared domain specifications semantics of methods applied to them. Thus, no matter becomes a basis for arranging collections of data and which kind of information object is used for research, it sources, collections of specific methods, embedding should be supplied with metadata in terms of ontologies. research results into such collections for further research. Those are data, metadata, publications, implementations of research methods, workflows describing the research Acknowledgments processes. Inference over ontologies makes it possible to select them from collections and access by selected The work was supported by Russian Foundation for identifiers. Basic Research (grant 18-07-01434). Semantics based approaches to research objects should be provided by inseparable linking of data and References well defined methods related to objects of research. It [1] Belhajjame K., et al: Workflow-Centric Research means that method collections are considered as a Objects: A First Class Citizen in the Scholarly specific data kinds. Methods used in any research domain should be defined, conceptually specified and collected Discourse. In: ESWC2012 Workshop on the in addition to general purpose methods such as Future of Scholarly Communication in the Semantic Web (SePublica2012), pp. 1-12. multidimentional data analysis or machine learning. Heraklion (2012). Meaningful access to known implementations of methods should be provided to humans and machines [2] Schentz H., le Franc Y. Building a semantic and be understandable for the both. repository using B2SHARE. In: EUDAT 3rd Experiments over data in research infrastructures are Conference (2014) constructed using shared and interoperable data, services [3] EOSC Declaration. and workflows. Research experiments can include data https://ec.europa.eu/research/openscience/pdf/eosc analysis, modelling in accordance with hypotheses and _declaration.pdf testing models by observational data. Besides providing [4] Wilkinson M., et al: The FAIR Guiding Principles access to data and method implementation collections, for scientific data management and stewardship. research infrastructures should include instruments for In: Scientific data, vol. 3 (2016) experiment supporting, in particular, formulation and [5] Wilkinson M.D., et al: Interoperability and testing of hypotheses [8]. FAIRness through a novel combination of Web technologies. In: PeerJ Preprints 5:e2522v2 (2017) 4 The role of communities https://doi.org/10.7287/peerj.preprints.2522v2 Since shared semantics of research objects are [6] Guidelines on FAIR Data Management in Horizon becoming increasingly important for data reuse in each 2020. Directorate-General for Research and specific discipline or subject domain, heterogeneous Innovation European Commission (2016). communities working in a domain should have http://ec.europa.eu/research/participants/data/ref/h conceptual specifications related to their research and 2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data- approaches and maintain strong commitment to them. mgt_en.pdf Communities of researchers and vendors of [7] Doorn P., Dillo I. FAIR Data in Trustworthy Data analytical tools, research instruments and data owners Repositories. DANS / EUDAT / OpenAIRE are interested in the long-term shared access to Webinar (2016). https://eudat.eu/events/webinar/ heterogeneous data and method collections. So the only fair-data-in-trustworthy-data-repositories-webinar way of conceptualization and formal specification of a [8] N. Skvortsov, L. Kalinichenko, D. Kovalev. domain is development in communities stimulated by a Conceptualization of Methods and Experiments in necessity to reach a semantic interoperability of Data Intensive Research Domains // Data interacting components, integration of data collections, Analytics and Management in Data Intensive reuse of data resources and method reproducibility due Domains (DAMDID/RCDL 2016). - CCIS, Vol. to binding to semantics of the subject domains. 706. - P. 3-17. – Springer, 2017. 15