Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                              Budva, Becici, Montenegro, September 30 – October 4, 2019


       BIG DATA TECHNOLOGIES FOR LABOUR MARKET
                       ANALYSIS
 S.D. Belov1,2 a, J.N. Javadzade1,2 b, I.S. Kadochnikov1,2 c, V.V. Korenkov1,2 d,
                                P.V. Zrelov1,2 e
  1
      Joint Institute for Nuclear Research, 6 Joliot-Curie St, Dubna, Moscow Region, 141980, Russia
       2
           Plekhanov Russian University of Economics, 36 Stremyanny per, Moscow, 117997, Russia
            E-mail: a belov@jinr.ru, b jjavadzade@yandex.ru c kadivas@jinr.ru, d korenkov@jinr.ru,
                                                e
                                                  zrelov@jinr.ru


This paper discusses some approaches to the intellectual text analysis in application to automated
monitoring of the labour market. The scheme of construction of an analytical system based on Big
Data technologies for the labour market is proposed. Were compared the combinations of methods of
extracting semantic information about objects and connections between them (for example, from job
advertisements) from specialized texts. A system for monitoring of the Russian labour market has
been created, and the work is underway to include other countries in the analysis. The considered
approaches and methods can be widely used to extract knowledge from large amounts of texts.


Keywords: text analysis, Big Data, labour market monitoring


                       Sergey Belov, Javad Javadzade, Ivan Kadochnikov, Vladimir Korenkov, Petr Zrelov

                                                                 Copyright © 2019 for this paper by its authors.
                        Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                            469
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


1. Previous work
        Recently, the prospects of "digitalization" of economic processes have been actively
discussed. This is an extremely difficult task that has no solution in the framework of traditional
methods. The prospects for their qualitative development in the article are illustrated by the example
of using Big Data analytics and text mining to assess the labor force needs of regional labor markets.
There is also an important question of studying the interaction between labour market and professional
education system [1]. The problem was solved using the automated information system developed by
the authors for monitoring the compliance of employers' personnel needs with the level of specialist
training. The information base for collecting information was open sources. The presented system
creates additional opportunities for identifying qualitative and quantitative relations between the
education sector and the labor market. It is aimed at a wide range of users: authorities and
administrations of regions and municipalities; management of universities, companies, recruitment
agencies; graduates and university graduates.
        In previous work [2] we described the approaches and the prototype of the labour market
monitoring system. Now, having enough real data from the market, it is possible to make more
elaborated analysis, allowing a more detailed understanding of market requirements.


2. Data processing infrastructure
        Every day, about 1.5 million active vacancies are updated and subject to analysis and
preservation. In order to track the dynamics of indicators and lay the basis for forecasting the state and
needs of the labor market, we need effective storage, intellectual analysis, and visualization of data on
vacancies for the maximum available time (we are considering data from 2015, which at the moment
is already 5 years). Therefore, the system was based on Big Data technologies. First of all, the
following freely distributed software products were used: Spark, Hadoop, Kafka, Flume, Marathon,
Chronos, Docker. Data processing schema is presented on Fig. 1.


                             Figure 1. Infrastructure and data flow scheme

         There are three big main data sources (job-seeking portals): trudvsem.ru, hh.ru, sj.ru.
Information on job offers is publicly available here, and custom collectors were developed to fetch
data from there.
         To support the processing, lambda-architecture was implemented. The analysis should be as
fast as possible, so in-memory processing is the main technology used. The key part of the
infrastructure is an Apache Spark [3] cluster.


                                                                                                      470
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


3. Job offers analysis and classification
         Basic information about the state of the labor market is obtained by analyzing the database of
collected vacancies. To obtain correct statistics, it is necessary to solve, first of all, the following tasks:
         1) Determination of duplicate vacancies. Even if you use one source, job ads can be
duplicated, but if you use multiple sources, such checks are necessary.
         2) Classification of vacancies by branches of professional activity.
         3) Analysis of the job offer content, analysis of individual requirements for skills and
competencies.
         The need to delete identical vacancies is connected with the fact that the sample we use
consists of uploading data from several sources, and on each of the sources the same vacancy can be
republished repeatedly with some time interval. In order that the data of the same vacancy were not
processed several times, it was decided to implement search of identical and similar vacancies with
further removal of duplicates. Despite a direct comparison under the name of employer, job title and
address, it is necessary to take into account the fact that the name of the position and the content of
jobs may change if re-published or the information could be just written on a slightly different way.
         Previously, to compare the meaning of text fields, the method of comparing the vector
representation of texts in semantic space was used (using the word2vec approach). Further, to make
the analysis more specific, it was necessary to distinguish words and expressions characteristic of
certain professions and fields of activity. For this purpose, the statistical indicator TF-IDF [4] (term
frequency - inverted document frequency) was used, which is mainly used to assess the importance
(weight) of a particular word (term) in the context of the entire document included in the general
collection (base).
         Due to the data from hh.ru and superjob.ru are already structured, it can be used as training
data for a kind of multi-label classification [5]. That is, initially there is a sample with about one
million marked data and it is possible to operate on it. The next step is to extract only the data
necessary for classification. These are the duties, requirements, as they contain basic information
about the job and a list of professional areas and specializations to which the job belongs. After the
preprocessing: removal of stop words, tokenization and lemmatization of the text, there is everything
necessary for further classification of the vacancy. Job offers than were classified against professional
areas and required competencies. For the classification, it was trained and used a neural network
implementation from the scikitlearn library. When jobs are classified, it is, moreover, easier to find
identical records in the database.


3. Conclusions
        A system for monitoring of the Russian labour market has been created, and the work is
underway to include other countries in the analysis. The considered approaches and methods can be
widely used to extract knowledge from large amounts of texts (it works fine on text data of terabyte-
scale volume). Using together Big Data technologies, statistical methods and machine learning
techniques allowed us to significantly accelerate the analysis and conduct it in a reasonable time for
researchers.

4. Acknowledgment
        The work was supported by the Russian Foundation for Basic Research (RFBR), grant
18-07-01359 "Development of an information-analytical system of monitoring and analysis of labour
market's needs for graduates of Universities on the basis of Big Data analytics".


                                                                                                          471
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


References
[1] Azmuk, N. (2015). The interaction of labour markets and higher education in the context of
digital technology. Economic Annals-XXI, 7–8(1), 98–101.
[2] S. Belov, I. Filozova, I. Kadochnikov, V. Korenkov, R. Semenov, P. Smelov, P. Zrelov Labour
market monitoring system, CEUR Workshop Proceedings, ISSN:1613-0073, Vol. 2267, pp. 528-532
[3] Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T.,
Franklin, M.J., Ghodsi. A., & Zaharia, M. 2015. Spark SQL: Relational Data Processing in Spark. In
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD
'15). ACM, New York, NY, USA, 1383-1394. DOI: https://doi.org/10.1145/2723372.2742797
[4] Mark Needham. scikit-learn: TF/IDF and cosine similarity for computer science papers //
Available at: https://markhneedham.com/blog/2016/07/27/scitkit-learn-tfidf-and-cosine-similarity-for-
computer-science-papers/ / (accessed 01.12.2019)
[5] Rocco Schulz. Performing Multi-label Text Classification with Keras // Available at:
https://blog.mimacom.com/text-classification/ (accessed 01.12.2019)


                                                                                                     472