<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ANALYTICAL PLATFORM FOR SOCIO-ECONOMIC STUDIES</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>S.D. Belov</string-name>
          <email>belov@jinr.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.V. Ilina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J.N. Javadzade</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I.S. Kadochnikov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V.V. Korenkov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I.S. Pelevanyuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V.A. Tarabrin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P.V. Zrelov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R.N. Semenov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Joint Institute for Nuclear Research</institution>
          ,
          <addr-line>6 Joliot-Curie st., Dubna, 141980</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Plekhanov Russian University of Economics</institution>
          ,
          <addr-line>Stremyanny lane 36, Moscow, 117997</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sergey Belov</institution>
          ,
          <addr-line>Anna Ilina, Javad Javadzade, Ivan Kadochnikov, Vladimir Korenkov, Igor Pelevanyuk, Roman Semenov, Vitaliy Tarabrin, Petr Zrelov</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>Started in natural sciences, the high demand for analyzing a vast amount of complex data reached such research areas as economics and social sciences. Big Data methods and technologies provide new efficient tools for research. In this paper, we discuss the main principles and architecture of the digital analytical platform aimed to support socio-economic applications. Integrating specific open-source solutions, the platform intended to cover full-cycle data analysis and machine learning experiments, from data gathering to visualization. One of the system's primary goals is to deliver the advantage of the cloud and distributed computing and GPU accelerators with Big Data analysis techniques. The authors present the approach of building the platform from low-level services such as storage, virtual infrastructure, pass-through authentication, up to data flows processing, analysis experiments, and results representation.</p>
      </abstract>
      <kwd-group>
        <kwd>Big Data platform</kwd>
        <kwd>socio-economic studies</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The processes of Big Data analysis in different areas, despite some peculiarities, are pretty
similar. Analytics solutions and methods are widely used and can be successfully used in various
fields of science. A platform-based approach for creating a software and hardware environment seems
promising, in which there are both basic, infrastructure components common to information flows of
all classes of tasks, and specialized services that improve the characteristics (for example, speed or
quality) obtained in a particular area of scientific and practical results. A generalized architecture of an
automated analytical system was proposed to solve problems requiring both streaming and batch
processing of large amounts of data or having great internal complexity, including implicit
connections. To build each functional level of the platform, the open-source software products were
selected, primarily from the Big Data technology stack.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Labour market monitoring project</title>
      <p>Recently, the prospects for the "digitalization" of economic processes have been actively
discussed. This is a challenging task that cannot be solved within the framework of classical methods.
The prospects for their qualitative development are illustrated in the article by the example of using
big data analytics and text mining to assess the labor demand of regional labor markets. Another
critical issue is the study of the interaction of the labor market and the vocational education system [1].
The problem was solved using the automated information system developed by the authors for
monitoring the compliance of the personnel needs of employers with the level of training of
specialists. The information base for collecting information is open source. The presented system
creates additional opportunities for identifying qualitative and quantitative relationships between the
education sector and the labor market. It is aimed at a wide range of users: authorities and
administrations of regions and municipalities; management of universities, companies, recruiting
agencies; graduates and graduates of universities.</p>
      <p>The purpose of introducing information systems for monitoring and forecasting the situation in
the labor market and analyzing staffing needs is to provide additional opportunities for identifying
qualitative and quantitative relationships between education and the labor market. The system is
designed for a wide range of users and is intended primarily for heads of regions, universities,
companies, recruitment agencies. It is expected that the project will provide a closer connection
between the education system in the country and the labor market, provide an opportunity to adjust
curricula, open new educational programs, or adjust existing ones in accordance with the country's
economic goals, and allows regions to implement effective recruitment and training. After that, it is
assumed that the system will become a useful tool for young professionals who are starting to look for
a job in their chosen profession, as well as for those choosing a profession.</p>
      <p>The following Internet resources are used as a source of data on vacancies: the portal "Work in
Russia" (information site of the Russian Labor Agency), portals of the staffing companies HeadHunter
and SuperJob. In addition, the register of approved professional standards and Federal state
educational standards of higher professional education are used as guiding documents [2]. The subject
of a separate study is assessing how job advertisements reflect the real needs of the market.</p>
      <p>The implemented prototype of an automated information system is a web application with an
intuitive user interface that provides reliable data storage.</p>
      <p>The system is built on a modular basis. Firstly, it is a text data collection module (working in
automatic mode using open sources - Internet portals and recruiting agencies).</p>
      <p>Secondly, the load module and data store, consisting of a distributed data store (provides
replication and archiving).</p>
      <p>Third, an automatic processing module that prepares information for analysis, automatic
linking of requirements and competencies, and machine learning.</p>
      <p>Fourth, user interfaces to generate and display reports based on business intelligence
technologies.</p>
      <p>Basic information about the state of the labor market is obtained by analyzing the database of
collected vacancies. To obtain correct statistics, it is necessary, first of all, to solve the following tasks:
• Search for duplicate vacancies. Even if one is using one source, job advertisements can be
duplicated, but such checks are necessary if one is using multiple sources.</p>
      <p>• Classification of vacancies by industry.</p>
      <sec id="sec-2-1">
        <title>The data processing schema if shown in [fig. 1].</title>
        <p>• Analysis of the content of the job offer, analysis of individual requirements for skills and
competencies.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Analysis of links between companies</title>
      <p>The project [3] aims to create a database of companies and data on companies and an
automated analytical system based on these data. The development of the system will allow credit
institutions to obtain information on relationships between companies, pursue the "Know Your Client"
policy - to identify the ultimate beneficiaries, assess risks, and identify relationships between clients.
This may be the need for banks to comply with the requirements of national authorities, laws on tax
evasion in offshore and FATCA, recommendations of the Financial Action Task Force on Money
Laundering (FATF), the Basel Committee on Banking Supervision. At the moment, there are some
projects, such as OpenCorporates [4], which have global databases of companies collected from many
jurisdictions. Nevertheless, at the same time, they do not cover all national registries or other helpful
data sources (courts, customs, press, etc.). In addition, existing services have a relatively meager
ability to find relationships between companies, which are not always straightforward. The project we
are presenting aims to overcome the main of these shortcomings. The number of companies
worldwide is over 150 million. With information about a company from many sources, there is no
other reasonable way to process it using big data technologies. We use such technologies in our
research along with machine learning and graphical databases.</p>
      <p>To identify the affiliation of companies and the direct comparison of relationships through
founders and owners, an analysis of indirect indicators is used. We are considering companies that
have a match in several positions. First, fragments of name, officers, founders, registration add ress,
contact information, owners, subsidiaries, historical ties, similarities in company names and profiles,
etc., in addition, it uses previously found relationships. Discovered information about certain
connections of companies is stored in a graphical database, in which records are both about the
company and other types of objects (officials, founders, registration address, contact information).
This approach allows for more flexible link analysis and complex search queries. The graph base
Neo4j [5] is used to analyze and store the identified links. This database also allows one to visualize
graphical relationships using built-in tools.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Analytical platform</title>
      <p>A generalized architecture of an automated analytical system was proposed to solve problems
requiring both streaming and batch processing of large amounts of data or having great internal
complexity, including implicit connections. To build each functional level of the platform, a set of
open-source software products was selected, primarily from the Big Data technology stack. The
architecture of the proposed solution is shown in [fig. 2].</p>
      <p>Intelligent analysis and reporting
Stream processing</p>
      <p>Batch processing
Data management</p>
      <p>VIsualization</p>
      <p>Main storage
Cloud resources</p>
      <p>Services and interfaces</p>
      <p>Processing
management</p>
      <sec id="sec-4-1">
        <title>Big Data processing</title>
        <p>Distributed storage
Intermediate</p>
        <p>storage</p>
      </sec>
      <sec id="sec-4-2">
        <title>Infrastructure</title>
        <p>Hardware accelerators</p>
        <p>The platform is based on the open-source software solutions. Its modular structure allows
replacing particular components if needed. Chosen packages are shown in the Table 1 below.
Businessintelligence</p>
        <p>API
Machine
learning
In-memory
databases</p>
        <p>System
services</p>
        <p>Task-specific
services:
•
Problem</p>
        <p>oriented
• Utility /</p>
        <p>system</p>
        <p>Data lake
External data
sources</p>
        <p>Grid</p>
        <sec id="sec-4-2-1">
          <title>Visualization and system interfaces</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Zeppelin, Jupyter (user interface)</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Distributed Big Data analytics</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>Computational Experiments in ML</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>In-memory computations</title>
        </sec>
        <sec id="sec-4-2-6">
          <title>Organization of the process of data flow management and data collection</title>
        </sec>
        <sec id="sec-4-2-7">
          <title>Data vaults and specialized databases</title>
        </sec>
        <sec id="sec-4-2-8">
          <title>Graphana (reporting and graphical presentation of results)</title>
        </sec>
        <sec id="sec-4-2-9">
          <title>KrakenD (organization of software gateways for various components)</title>
        </sec>
        <sec id="sec-4-2-10">
          <title>Apache Kylin MLflow</title>
        </sec>
        <sec id="sec-4-2-11">
          <title>Apache Spark, Dask, Hadoop</title>
        </sec>
        <sec id="sec-4-2-12">
          <title>Apache Kafka, Apache Flume,</title>
        </sec>
        <sec id="sec-4-2-13">
          <title>Apache Airflow, Celery, Scrapy</title>
        </sec>
        <sec id="sec-4-2-14">
          <title>CEPH, NFS (хранение и доступ к файлам)</title>
        </sec>
        <sec id="sec-4-2-15">
          <title>Elasticsearch (structured data indexing and analysis)</title>
        </sec>
        <sec id="sec-4-2-16">
          <title>Apache Ignite (in-memory database for fast access and caching)</title>
        </sec>
        <sec id="sec-4-2-17">
          <title>Russian Data Lake</title>
        </sec>
        <sec id="sec-4-2-18">
          <title>Apache Calcite (dynamic data management and integration)</title>
        </sec>
        <sec id="sec-4-2-19">
          <title>Free IPA, Vault</title>
        </sec>
        <sec id="sec-4-2-20">
          <title>Authentication and passthrough authorization, security</title>
        </sec>
        <sec id="sec-4-2-21">
          <title>Computing infrastructure, resource management</title>
        </sec>
        <sec id="sec-4-2-22">
          <title>OpenNebula, Kubernetes, Docker, Puppet, Git</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Based on the experience of using big data technologies, a schematic of an analytical platform
for performing socio-economic research was proposed. In addition, the selection of open-source
software for building a modular analytical platform that allows analyzing Big Data using machine
learning and hardware accelerators has been performed.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgement References</title>
      <p>—</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          19-
          <fpage>71</fpage>
          -30008). [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <source>Review of Vocational Education // The Wolf Report</source>
          ,
          <year>2011</year>
          [2] Professional standards in Russia - [Web resource]. - http://profstandart.rosmintrud.ru [3]
          <string-name>
            <surname>Badalov</surname>
            <given-names>L.A.</given-names>
          </string-name>
          et al.,
          <article-title>Checking foreign counterparty companies using Big Data</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2018</year>
          , vol.
          <volume>2267</volume>
          , pp.
          <fpage>523</fpage>
          -
          <lpage>527</lpage>
          [4] OpenCorporates: The https://opencorporates.com/ [5]
          <article-title>Neo4j graph database</article-title>
          . Available at: https://neo4j.com/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>