Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 ANALYTICAL PLATFORM FOR SOCIO-ECONOMIC STUDIES S.D. Belov1,2,a, A.V. Ilina1, J.N. Javadzade1,3, I.S. Kadochnikov1,2, V.V. Korenkov1,2, I.S. Pelevanyuk1,2, V.A. Tarabrin2, P.V. Zrelov1,2 and R.N. Semenov1,2 1 Joint Institute for Nuclear Research, 6 Joliot-Curie st., Dubna, 141980, Russia 2 Plekhanov Russian University of Economics, Stremyanny lane 36, Moscow, 117997, Russia E-mail: a belov@jinr.ru Started in natural sciences, the high demand for analyzing a vast amount of complex data reached such research areas as economics and social sciences. Big Data methods and technologies provide new efficient tools for research. In this paper, we discuss the main principles and architecture of the digital analytical platform aimed to support socio-economic applications. Integrating specific open-source solutions, the platform intended to cover full-cycle data analysis and machine learning experiments, from data gathering to visualization. One of the system's primary goals is to deliver the advantage of the cloud and distributed computing and GPU accelerators with Big Data analysis techniques. The authors present the approach of building the platform from low-level services such as storage, virtual infrastructure, pass-through authentication, up to data flows processing, analysis experiments, and results representation. Keywords: Big Data platform, socio-economic studies, machine learning Sergey Belov, Anna Ilina, Javad Javadzade, Ivan Kadochnikov, Vladimir Korenkov, Igor Pelevanyuk, Roman Semenov, Vitaliy Tarabrin, Petr Zrelov Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 619 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Introduction The processes of Big Data analysis in different areas, despite some peculiarities, are pretty similar. Analytics solutions and methods are widely used and can be successfully used in various fields of science. A platform-based approach for creating a software and hardware environment seems promising, in which there are both basic, infrastructure components common to information flows of all classes of tasks, and specialized services that improve the characteristics (for example, speed or quality) obtained in a particular area of scientific and practical results. A generalized architecture of an automated analytical system was proposed to solve problems requiring both streaming and batch processing of large amounts of data or having great internal complexity, including implicit connections. To build each functional level of the platform, the open-source software products were selected, primarily from the Big Data technology stack. 2. Labour market monitoring project Recently, the prospects for the "digitalization" of economic processes have been actively discussed. This is a challenging task that cannot be solved within the framework of classical methods. The prospects for their qualitative development are illustrated in the article by the example of using big data analytics and text mining to assess the labor demand of regional labor markets. Another critical issue is the study of the interaction of the labor market and the vocational education system [1]. The problem was solved using the automated information system developed by the authors for monitoring the compliance of the personnel needs of employers with the level of training of specialists. The information base for collecting information is open source. The presented system creates additional opportunities for identifying qualitative and quantitative relationships between the education sector and the labor market. It is aimed at a wide range of users: authorities and administrations of regions and municipalities; management of universities, companies, recruiting agencies; graduates and graduates of universities. The purpose of introducing information systems for monitoring and forecasting the situation in the labor market and analyzing staffing needs is to provide additional opportunities for identifying qualitative and quantitative relationships between education and the labor market. The system is designed for a wide range of users and is intended primarily for heads of regions, universities, companies, recruitment agencies. It is expected that the project will provide a closer connection between the education system in the country and the labor market, provide an opportunity to adjust curricula, open new educational programs, or adjust existing ones in accordance with the country's economic goals, and allows regions to implement effective recruitment and training. After that, it is assumed that the system will become a useful tool for young professionals who are starting to look for a job in their chosen profession, as well as for those choosing a profession. The following Internet resources are used as a source of data on vacancies: the portal "Work in Russia" (information site of the Russian Labor Agency), portals of the staffing companies HeadHunter and SuperJob. In addition, the register of approved professional standards and Federal state educational standards of higher professional education are used as guiding documents [2]. The subject of a separate study is assessing how job advertisements reflect the real needs of the market. The implemented prototype of an automated information system is a web application with an intuitive user interface that provides reliable data storage. The system is built on a modular basis. Firstly, it is a text data collection module (working in automatic mode using open sources - Internet portals and recruiting agencies). Secondly, the load module and data store, consisting of a distributed data store (provides replication and archiving). Third, an automatic processing module that prepares information for analysis, automatic linking of requirements and competencies, and machine learning. 620 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Fourth, user interfaces to generate and display reports based on business intelligence technologies. Basic information about the state of the labor market is obtained by analyzing the database of collected vacancies. To obtain correct statistics, it is necessary, first of all, to solve the following tasks: • Search for duplicate vacancies. Even if one is using one source, job advertisements can be duplicated, but such checks are necessary if one is using multiple sources. • Classification of vacancies by industry. • Analysis of the content of the job offer, analysis of individual requirements for skills and competencies. The data processing schema if shown in [fig. 1]. Figure 1. Data processing for labour market analysis project 3. Analysis of links between companies The project [3] aims to create a database of companies and data on companies and an automated analytical system based on these data. The development of the system will allow credit institutions to obtain information on relationships between companies, pursue the "Know Your Client" policy - to identify the ultimate beneficiaries, assess risks, and identify relationships between clients. This may be the need for banks to comply with the requirements of national authorities, laws on tax evasion in offshore and FATCA, recommendations of the Financial Action Task Force on Money Laundering (FATF), the Basel Committee on Banking Supervision. At the moment, there are some projects, such as OpenCorporates [4], which have global databases of companies collected from many jurisdictions. Nevertheless, at the same time, they do not cover all national registries or other helpful data sources (courts, customs, press, etc.). In addition, existing services have a relatively meager ability to find relationships between companies, which are not always straightforward. The project we are presenting aims to overcome the main of these shortcomings. The number of companies 621 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 worldwide is over 150 million. With information about a company from many sources, there is no other reasonable way to process it using big data technologies. We use such technologies in our research along with machine learning and graphical databases. To identify the affiliation of companies and the direct comparison of relationships through founders and owners, an analysis of indirect indicators is used. We are considering companies that have a match in several positions. First, fragments of name, officers, founders, registration address, contact information, owners, subsidiaries, historical ties, similarities in company names and profiles, etc., in addition, it uses previously found relationships. Discovered information about certain connections of companies is stored in a graphical database, in which records are both about the company and other types of objects (officials, founders, registration address, contact information). This approach allows for more flexible link analysis and complex search queries. The graph base Neo4j [5] is used to analyze and store the identified links. This database also allows one to visualize graphical relationships using built-in tools. 4. Analytical platform A generalized architecture of an automated analytical system was proposed to solve problems requiring both streaming and batch processing of large amounts of data or having great internal complexity, including implicit connections. To build each functional level of the platform, a set of open-source software products was selected, primarily from the Big Data technology stack. The architecture of the proposed solution is shown in [fig. 2]. Services and interfaces Intelligent analysis and reporting Business- intelligence Task-specific Data management services: DATA TAKING AND PROCESSING Processing management API • Problem- VIsualization oriented • Utility / Big Data processing system Machine Stream processing Batch processing learning Distributed storage Data lake Intermediate In-memory Main storage storage databases External data sources Infrastructure Cloud resources System Grid Hardware accelerators services Figure 2. General scheme of the analytical platform The platform is based on the open-source software solutions. Its modular structure allows replacing particular components if needed. Chosen packages are shown in the Table 1 below. 622 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Table 1. Software stack of the platform. Layer Software packages Visualization and system interfaces Zeppelin, Jupyter (user interface) Graphana (reporting and graphical presentation of results) KrakenD (organization of software gateways for various components) Distributed Big Data analytics Apache Kylin Computational Experiments in ML MLflow In-memory computations Apache Spark, Dask, Hadoop Organization of the process of data flow Apache Kafka, Apache Flume, management and data collection Apache Airflow, Celery, Scrapy Data vaults and specialized databases CEPH, NFS (хранение и доступ к файлам) Elasticsearch (structured data indexing and analysis) Apache Ignite (in-memory database for fast access and caching) Russian Data Lake Apache Calcite (dynamic data management and integration) Authentication and passthrough authorization, Free IPA, Vault security Computing infrastructure, resource management OpenNebula, Kubernetes, Docker, Puppet, Git 5. Conclusion Based on the experience of using big data technologies, a schematic of an analytical platform for performing socio-economic research was proposed. In addition, the selection of open-source software for building a modular analytical platform that allows analyzing Big Data using machine learning and hardware accelerators has been performed. 6. Acknowledgement The study was carried out at the expense of the Russian Science Foundation grant (project No. 19-71-30008). References [1] A. Wolf, Review of Vocational Education // The Wolf Report, 2011 [2] Professional standards in Russia – [Web resource]. – http://profstandart.rosmintrud.ru [3] Badalov L.A.et al., Checking foreign counterparty companies using Big Data, CEUR Workshop Proceedings, 2018, vol. 2267, pp. 523–527 [4] OpenCorporates: The Open Database Of The Corporate World — Available at: https://opencorporates.com/ [5] Neo4j graph database. Available at: https://neo4j.com/ 623