=Paper=
{{Paper
|id=Vol-3006/09_short_paper
|storemode=property
|title=A multi-platform ecosystem for computing in Earth sciences
|pdfUrl=https://ceur-ws.org/Vol-3006/09_short_paper.pdf
|volume=Vol-3006
|authors=Vitaliy S. Eremenko,Vera V. Naumova
}}
==A multi-platform ecosystem for computing in Earth sciences==
A multi-platform ecosystem for computing in Earth sciences Vitaliy S. Eremenko1 , Vera V. Naumova1 1 Vernadsky State Geological Museum RAS, Moscow, Russia Abstract Analysis of diverse data in geosciences requires access to various processing tools, including specialized software packages, proprietary algorithms, GIS systems, web services, etc. Such tools require from the user a certain level of skills to work with them, form a software environment, the availability of the necessary computing power, and sometimes significant time and financial costs. With the development of information technology, more and more software products, including professional software packages for data processing, are provided to users in the format of various cloud services and platforms. In such systems, computation takes place on the side of the service provider and is accessed through a web browser. The emergence of such open-access services and platforms makes it possible to organize a single workspace for a researcher with the ability to analyze his own data using various processing methods, including tools traditionally used in earth sciences. The report is devoted to the development of an approach for the integration of various tools for processing heterogeneous data with open access within a single multi-platform ecosystem. The software system developed based on the proposed approach is demonstrated. The report describes software modules that implement the functions of access to processing and analysis tools, as well as service modules for system administration, component monitoring and event logging. Services and processing platforms integrated into the ecosystem are considered, as well as scenarios for solving some geological problems. Keywords Web services, cloud services, processing of geological data. 1. Introduction Since 2017, at the Vernadsky State Geological Museum RAS, an information and analytical environment is being developed to support scientific research in geology (http://geologyscience. ru) [1]. The environment provides a single point of access to various types of geological information on the territory of Russia, as well as a set of tools for processing and analyzing this data. Processing and analyzing data in geology requires the application of a large number of different algorithms, processing procedures and corresponding software solutions. With the development of information technology, approaches to the organization of such processing have changed. Solving the problem of analyzing large amounts of data or tasks requiring the use of resource-intensive methods required the acceleration of software computations. To solve such problems, both parallel computing using supercomputers and distributed computing using SDM-2021: All-Russian conference, August 24–27, 2021, Novosibirsk, Russia " vitaer@gmail.com (V. S. Eremenko) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 67 Vitaliy S. Eremenko et al. CEUR Workshop Proceedings 67–73 a large number of computing devices combined into a single computing system, such as GRID systems, were used. However, access to such systems is limited. The advent of cloud services has allowed users to gain easy access to the computing resources they are interested in. The use of external services instead of custom applications allows data processing on the equipment that is most suitable for the corresponding tasks. Thus, data processing is more efficient, and the user gets the opportunity to process data with the most current versions of algorithms, using a web interface for this, without the need to install, configure and maintain software for processing on his personal computer. Due to the continuous growth in the amount of geological data and tools for their analysis, it becomes necessary to organize a single workspace for a geologist-researcher with the ability to use open access tools for analyzing geological information available in the world. The purpose of the computing block of the information and analytical environment is to pro- vide researchers with access to geographically distributed services for processing and analyzing geological information (https://service.geologyscience.ru) [2, 3]. Basic functional requirements: ∙ ability to use external open information processing services in unlimited quantities; ∙ working with data provided by users and open information systems; ∙ cataloging external processing services. 2. Implementation To organize a single mechanism for interacting with web services, the OGC Web Processing Service (WPS, http://www.opengeospatial.org/standards/wps) interface is used, developed by the international organization for standardization of the Open Geospatial Consortium. This interface is implemented on the basis of the HTTP protocol (HTTPS), is widely used in various scientific data processing systems [4]; a variety of software products support the operation with geoinformational services. Basic requirements for this type of service: ∙ Permanent IP address (or domain name) and a port with external access via the Internet. ∙ Client program interface (API) and libraries implementing it in the Java language, or via the HTTP/HTTPS protocol (REST or SOAP); ∙ Ability to start the processing procedure with the specified parameters; ∙ Ability to obtain the result of processing in text or binary formats, including in the form of URL links; ∙ Ability to work with user data in one of the following ways: — reading remotely placed data by URL address; — temporary loading data to the computing node; — transferring data in binary format as a processing parameter. Open source software package GeoServer (http://geoserver.org) was selected to create and host your own WPS processes. Thus, for each external service, a separate WPS process is created with the corresponding startup parameters. Using WPS as an access interface to remote 68 Vitaliy S. Eremenko et al. CEUR Workshop Proceedings 67–73 services allows you to execute several computational processes sequentially. Thus, outcomes of one or several processes can be used as input parameters for another process, thereby providing the possibility to combine the results of processing different types of data. Using a common access protocol to interact with cloud services that provide users with interactive access through a web interface is difficult due to the heterogeneity of the access protocols used in each platform. However, some common properties of platforms, such as the presence of centralized data storages and single authentication systems, make it possible to organize interaction by transferring data for analysis to the data storages corresponding to the platforms on behalf of a specific user. The implementation of such a mechanism has become possible when using web-application technology, which allows you to request permission from the user to access certain capabilities of the user account of various cloud service providers. This technology is supported by Microsoft, Google, Yandex, ESRI, etc. Before using the corresponding service, the user needs to upload data for analysis to his personal storage. The user is authorized on the website of the service provider, after which the application asks the user for permission to download and publish data to its storage. The user has the ability to move data for processing from one storage to another when choosing a cloud service from different providers. Thus, we have proposed an approach and a technological solution for organizing a single data space for various cloud service providers. A monitoring module has been developed to track the state of geographically distributed components of the ecosystem [5]. The following general types of tests are presented. A. Checking the availability of a remote site. B. Checking service performance at a remote site using the required communication protocol. C. Checking for changes in the operation of the service based on test requests. The module of the catalog of external services for processing and analyzing geological information is used as a data source for monitoring. 3. Computing capabilities The developed software ecosystem provides access to the following services and processing platforms. — Computing node “Multidimensional methods of data analysis”, developed at the State Geological Museum of the Russian Academy of Sciences, which allows you to process tabular data by various methods of data analysis with setting their parameters and visualizing the results. The computational node includes such groups of methods as statistical analysis, regression analysis, factor analysis, clustering, machine learning, visualization methods, and others. Calculations are performed in the Python environment using well-known data processing packages: Scikit-learn, Pandas, Matlplotlib and others. The processing of incoming requests for processing is carried out by the Flask framework through the REST API using the task queue, implemented on the basis of the NoSQL Redis database. This architecture allows the processing of requests and heavy computations of 69 Vitaliy S. Eremenko et al. CEUR Workshop Proceedings 67–73 large amounts of data to be separated, which provides fault tolerance and scalability of the node. In the future, it is planned to expand the number of data processing methods with the involvement of specialized processing packages for solving geological tasks. — Petrological and geochemical data processing. An interactive database of methods for processing petrological and geochemical data has been developed at the Schmidt Institute of Physics of the Earth of RAS [6]. This system provides services for constructing spidergrams, histograms and classification diagrams; service for identification of minerals by their chemical composition; service for the interpretation of the composition of the mineral and decomposition into minals, etc. The interface for interaction with services is based on REST architecture. — Structural analysis of publications. The Interdisciplinary Center for Mathematical and Computational Modeling (University of Warsaw, Poland) has developed a service for extracting metadata from scientific publications [7]. Metadata includes authors, affiliation, abstract, keywords, journal title, volume, year of issue, parsed bibliographic references, document section structure, section headings and paragraphs. The interface for interaction with services is built on the basis of REST architecture. — Natural language processing. At the University of Sheffield, the GATE (General Archi- tecture for Text Engineering) project has developed a number of services for processing text data for various languages [8]. For processing textual data in Russian, services are provided to determine the parts of speech of words, as well as to highlight named entities, such as names and surnames, names of organizations, geographical names, dates, monetary units, etc. The interface for interaction with services is based on REST architecture. Figure 1: General functional diagram of a multi-platform ecosystem. 70 Vitaliy S. Eremenko et al. CEUR Workshop Proceedings 67–73 — Microsoft Office Online is a cloud-based platform that allows users to work with web versions of software products such as Word, Excel, PowerPoint, OneNote. Microsoft Excel is one of the key tools for working with tabular data in geology (https://www.office.com/ launch/excel). It contains tools for viewing and editing tabular data, as well as a set of analytical functions and tools for building various types of charts. — ArcGIS Online (https://www.arcgis.com) is a cloud-based mapping and analysis solution. The ArcGIS platform allows users to work with 2D and 3D data to explore and visualize it. One of the key features is the ability for multiple users to collaborate on the same data. The platform provides tools for creating web maps, 3D scenes and notebooks. Using ArcGIS Notebook allows you to access Python resources to perform analysis, automate workflows, and visualize data. — Google Earth Engine is a satellite data analysis platform (https://earthengine.google.com). This platform allows the user to upload their own data or use data from the Earth Engine catalog for further processing in an interactive mode. The catalog contains data processing products for the Modis radiometer (Aqua, Terra satellites), Sentinel-1A, Sentinel-1B, Sentinel-2A, Sentinel-2B, Landsat 8, etc. creation, editing and launching using Javascript and Python programming languages. To work, the user needs a Google account. For analysis and processing, you can use data from the Google cloud storage. The general functional diagram of the multi-platform ecosystem is shown in Figure 1. 4. Testing Testing of the system functions of the processing platform took place with the automatic testing tools provided by the framework used in the development. Figure 2: Service catalog interface for geological data analysis. 71 Vitaliy S. Eremenko et al. CEUR Workshop Proceedings 67–73 List of tested system functions: ∙ interaction with the platform data store; ∙ interaction with the service catalog; ∙ interaction with WPS services; ∙ interaction with third-party data stores. The testing of the services presented on the platform was carried out in manual mode using the initial data, input parameters and expected results based on data taken from the sources corresponding to the service topic. So, for example, to test the operation of the computing node services using multidimensional methods of data analysis, the materials of the book by J. Davis “Statistics and analysis of geological data” were used. 5. Conclusion This ecosystem is being developed to provide geological research with modern methods and tools for analyzing geological information, which implies a further increase in the number of processing tools and their varieties provided by the platform. The principles of the ecosystem being developed can be used in the future to create various digital computing platforms to support and accompany scientific research. Acknowledgments The study is supported by the Government contract No. 0140-2019-0005 with SGM RAS “Devel- opment of an information environment for integrating data from natural science museums and their processing services for Earth sciences”. References [1] Naumova V.V., Platonov K.A., Eremenko V.S., Patuk M.I., Dyakov S.E. Information and analytical environment for supporting scientific research in geology: current state and development prospects // Proceedings of the XVII International Conference “Distributed Information and Computing Resources (DICR-2019)”. 2019. P. 139–147. (In Russ.) [2] Eremenko V.S., Naumova V.V., Platonov K.A., Dyakov S.E., Eremenko A.S. The main components of a distributed computational and analytical environment for the scientific study of geological systems // Russian Journal of Earth Sciences. 2018. Vol. 18. Is. 6. [3] Eremenko V.S., Naumova V.V., Zagumennov A.A., Bulov S.V. Cloud technologies for devel- opment of geographically distributed computational and analytical geological environ- ment // Computational Technologies. 2021. Vol. 26. No. 1. P. 86–98. [4] Bychkov I.V., Ruzhnikov G.M., Fjodorov R.K., Shumilov A.S. Components of WPS-services for geodata processing environment // Vestnik NSU. Series: Information Technologies. 2014. Vol. 12. No. 3. P. 16–24. (In Russ.) 72 Vitaliy S. Eremenko et al. CEUR Workshop Proceedings 67–73 [5] Eremenko V.S., Naumova V.V. A system for cataloging and monitoring geographically distributed computing nodes in an environment of WPS services for solving geological problems // Vestnik NSU. Series: Information Technologies. 2019. Vol. 17. No. 2. P. 39–48. (In Russ.). [6] Ivanov S.D. Interactive web application based geosensors registry // Computer Research and Modeling. 2016. Vol. 8. No. 4. P. 621–632. (In Russ.) [7] Tkaczyk D., Szostek P., Fedoryszak M., Dendek P., Bolikowski L. CERMINE: Automatic extraction of structured metadata from scientific literature // International Journal on Document Analysis and Recognition. 2015. Vol. 18. No. 4. P. 317–335. [8] Maynard D., Bontcheva K., Augenstein I. Synthesis lectures on the semantic web: Theory and technology // December 2016. Vol. 6. No. 2. P. 1–194. 73