Applying Natural Language Processing Models to Create
Recommendations for Professional Skills Development

Katsiaryna Kosarava 1
1
    Yanka Kupala State University of Grodno, 22 Ozheshko str., Grodno, 230023, Belarus


                 Abstract
                 The article examines the problem of applying intellectual methods and models for the
                 automatic extraction of professional skills and competencies from the descriptions of academic
                 disciplines in Russian. The accuracy of pre-trained multilingual models of distributed vector
                 representation LASER and SBERT is investigated. The process of collecting and preprocessing
                 a dataset for creating and training skills classification models is described. A neural network
                 model of skills classification based on a bidirectional recurrent neural network with long-short-
                 term memory is described and trained. It was trained on data prepared two different ways: 1)
                 skills were separated on “soft skills” and “hard skills”, the last one was not translated into
                 Russian; 2) “soft skills” and “hard skills” were treated as one class and were translated into
                 Russian. Learning curves on validation data are presented. They show high accuracy of trained
                 models up to 99%. The test accuracy of the models was analyzed employing confusion matrices
                 and the results show a fairly high performance of the model. an example of skills extraction
                 from an academic discipline using trained models is given.

                 Keywords 1
                 Professional skills, academic disciplines, natural language processing, word2vec, LASER,
                 SBERT, neural network

1. Introduction

    The sphere of education and training is one of the most conservative, and often training programs
have not changed for decades. However, rapidly developing technologies and the development of new
sectors of the modern economy require a specialist to master new skills that are not provided for in
classical higher education. Thus, the main problem of education is the actualization of educational
specialties and disciplines, by the needs of the modern labor market. Understanding what skills will be
in demand shortly and what disciplines will ensure the development of these skills will allow students
to form a plan for their professional growth. However, educational standards and curricula do not
contain articulated skills. As in other industries, artificial intelligence has the potential to optimize the
process of finding basic information in unstructured text and performing labor-intensive tasks for people
[1, 2]. There are several ways in which artificial intelligence helps to match skills and their
corresponding directions in work and learning. The article demonstrates the use of neural network
models to search for professional skills in the descriptions of academic disciplines. This will help to
optimize the choice of suitable disciplines by students, and also thanks to this approach, the picture of
the relevance and demand for certain disciplines will be as plausible as possible.
    The purpose of this article is to describe the application of distributed vector representation models
to compare educational disciplines and skills that a student can master while studying them, as well as
a comparative analysis of the accuracy of these models.

      Proceedings of VI International Scientific and Practical Conference Distance Learning Technologies (DLT–2021), September 20-22,
2021, Yalta, Crimea
EMAIL: koluzaeva@gmail.com
ORCID: 0000-0001-7326-5307
              ©️ 2021 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 69
2. Distributed vector representation models
    In text mining, the first step is to convert the text to a form that is understandable for the machine
learning model. The most famous approaches are "one-hot encoding", encoding each word with a
unique number, word embeddings. All of these approaches are widely known and are described in detail
in the literature [3].

2.1 Word Embeddings

    The “Word embeddings” approach not only allows you to “encode text”, but also makes it possible
to use an efficient dense representation in which similar words have a “similar” encoding. The result of
inlining is a dense vector of floating-point values (the length of the vector is a parameter you specify,
which does not depend on the size of the dictionary). Instead of specifying the values to be embedded
manually, they are trainable parameters (weights obtained by the model during training, just like the
model learns weights for a dense layer) [4]. When working with large datasets, it is common to see
word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensional. Higher-
dimensional embedding can capture fine-grained relationships between words but requires more data
to explore.
    This method allows you to capture the context of a word in a document, semantic and syntactic
similarity, relationships with other words, etc. Word embedding helps in achieving such goals as
identifying similar words, semantic grouping, which groups things with the same characteristics
together and dissimilarly far away; text classification: the text is matched with arrays of vectors, which
are passed to the model for training and predicting; clustering of documents, etc. Some examples in
which vectorized words can be used directly include applications for synonym creation, autocorrect,
and predictive text input. Similar types of methods are used to perform fuzzy searches on Google and
similar search tools with an almost infinite number of internal search capabilities that can be applied to
directories and databases of organizations. In addition, since embeddings usually have good behavior,
you can also perform arithmetic operations on vectors. This allows for unique operations where
embeddings not only capture similarities between words but also encode higher-level concepts. The
embeddings also allow logical analogies. For example, Rome is for Italy, like Beijing is for China -
word embeddings can use such analogies and directly give believable answers.
    Finally, almost all other modern architectures now use some form of embedding layer and language
model as the first step in performing subsequent NLP tasks. These downstream tasks include document
classification, named entity recognition, question and answer systems, language generation, machine
translation, and more.

2.2 Word2Vec

    An example of models implementing the word embedding approach is the word2vec model.
Word2vec is a well-known concept used to generate vector representations from words [5]. Word2vec
is a three-layer neural network that processes text by “vectorizing” words. Its input is a text corpus, and
its output is a set of vectors that represent words in that corpus, Figure 1. The purpose and usefulness
of word2vec are to group vectors of identical words in a vector space, similar words are represented by
vectors that are close in space. However, the word2vec model is not multilingual. For this reason, the
results can be significantly distorted depending on the language used, as well as in the presence of
untranslatable terminology (names of manufacturers, technologies, etc.).
    Word2Vec uses 2 algorithms: Continuous Bag of Words (CBoW) and Skip-gram model. CBoW
creates a sliding window around the current word to predict it from the used "context" of the surrounding
words. Each word is presented as a feature vector containing information describing the most important
characteristics of the object. After training, these vectors become word vectors. Vectors that represent
the same words are close in different distance metrics and additionally include numerical ratios.


                                                    70
   Skip-gram is the opposite of CBoW. Instead of predicting one word at a time, it uses one word to
predict all the surrounding words ("context"). Skip-gram is much slower than CBoW, but is considered
more accurate for rare words.


Figure 1: Word2Vec model


2.3 Large Pre-Trained Models for NLG

    With the development of deep learning, various neural network architectures have come to be widely
used to solve NLP problems. One of the main advantages of deep neural models is their ability to design
features [6].
    Pre-trained models (PTMs) can learn generic language representations in a large corpus, which is
useful for subsequent NLP tasks and avoids learning a new model from scratch. First-generation PTMs
are aimed at learning good embeddings for Skip-Gram [5] and GloVe [Jeffrey Pennington, Richard
Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In EMNLP,
2014]. Second-generation PTMs are focused on learning contextual word embeddings such as ELMo
[Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and
Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018], OpenAI GPT
[Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. 2018. URL: https://s3-us-west-2.amazonaws. com/openai-
assets/researchcovers/languageunsupervised/ languageunderstandingpaper.pdf], and BERT [Jacob
Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. In NAACL-HLT, 2019]. These models can be
used to represent words in the context of subsequent tasks.
    LASER and Sentence BERT (SBERT) models are alternatives to word2vec. The main advantages
of both models are that they are initially pre-trained on large amounts of data, and also support many
different languages besides English.
    LASER (Language-Agnostic Sentence Representations) [7] is a library developed by Facebook for
calculating and using multilingual word embeddings. This model is trained in 93 languages, written in
23 different alphabets. This includes all European languages, many Asian and Indian languages, Arabic,
Persian, Hebrew, and various minority languages and dialects. The sentence encoder of this model also
supports language switching, that is, the same sentences can contain words in several different
languages.
    BERT[8] and RoBERTa [9] have established new leading-edge performance in sentence pair
regression problems such as semantic textual similarity. However, this requires both proposals to be
submitted to the network, which is computationally expensive: it takes about 50 million computations
(~ 65 hours) with BERT to find the most similar pair in a collection of 10,000 sentences. The BERT
construct makes it unsuitable for semantic similarity searches as well as unsupervised tasks such as
clustering. The Sentence-BERT (SBERT)[10] model replaced BERT in the listed tasks, a modification

                                                  71
of the pre-trained BERT network that uses Siamese and triplet networks to obtain semantically
meaningful sentence attachments that can be compared using cosine similarity. This reduces the effort
to find the closest match from 65 hours with BERT/RoBERTa to about 5 seconds with SBERT while
maintaining BERT accuracy. In addition, SBERT can be used for certain new tasks that, until now, did
not apply to BERT. These tasks include large-scale semantic similarity comparison, clustering, and
information retrieval using semantic search. SBERT has set new leading performance for a variety of
sentence classification and sentence pair regression problems.

3. Data Collection and Preprocessing
    The first step in solving this problem is data collection. Since the goal is to search for academic
disciplines for the development of professional skills and competencies of students, then the data set is
information about all possible skills that students or employees in the field of information technology
may have. To obtain this information, we used the capabilities of the application interface of the EMSI
(Economic Modeling, LLC) project, an analytical company that uses data to identify patterns and trends
in the labor market [11]. EMSI provides a full range of IT skills in English, including "Hard Skill"
(28094 items) and "Soft Skill" (327 items) on the date of the experiment. "Hard Skill" contains the
names of technologies (for example, the names of programming languages or technical terms: Java,
web development, system administration), "Soft Skill" are analogous to personal qualities
(coordination, time management, attention to detail, etc.), Table 1.

Table 1
Example of IT skills from EMSI
                    ID                                      Name                      Type name
 KS126XS6CQCFGC3NG79X                     .NET Assemblies                       Hard Skill
 KS1200B62W5ZF38RJ7TD                     .NET Framework                        Hard Skill
 KS1238D6SC8NVDVCSMRB                     Amazon DynamoDB                       Hard Skill
 KS120NM6L2KDLC2Z9DND                     Artificial Intelligence Systems       Hard Skill
 ES7A07CF87C516875B23                     Accountability                        Soft Skill
 ESCBF9FD5DB49E82C7C9                     Basic Reading                         Soft Skill
 KS95IACBUORYQ0U7M90W                     Calmness Under Pressure               Soft Skill
 KS79O5AFPWQ4IEY9HCH5                     Challenge Driven                      Soft Skill

    Before training the models, it is necessary to clean and preprocess the collected data: convert all
letters to lowercase, remove numbers, extra spaces, special characters, punctuation marks, tokenize
sentences and bring words to their initial form.

4. Application of Pre-Trained Models in the Problem of Finding Skills in the
   Descriptions of Educational Courses

4.1      Methodology

  Let us describe the methodology for extracting professional skills from academic disciplines in
Russian using multilingual pre-trained models. The technique includes the following steps:
  1. Extracting all noun phrases (as the basis for describing skills) of a given length from the
  description of the academic discipline. To implement this, the praise machine library [12] was used,
  which allows extracting a list of noun phrases from a sentence. In addition, the pymorphy2 library
  [13] was used to implement support for the Russian language, and the gensim library was used for
  text preprocessing (tokenization, removal of stop words, etc.) [14].
  2. Vectorization of phrases extracted from the academic discipline using a pre-trained model. The
  result of this step is a set of vectors of a given dimension (1024 for LASER and 512 for SBERT)
  corresponding to the phrases extracted in step 1.
                                                   72
  3. Vectorization of the list of professional skills (EMSI) using a pre-trained model. The result of this
  step is a set of vectors of a given dimension corresponding to hard and soft skills.
  4. Comparison of the vectors obtained in items 2 and 3 by the cosine measure of proximity -
  a measure of similarity between two non-zero vectors calculated as the cosine of the angle between
  them. The cosine measure of proximity the measure changes from 0 to 1. In NLP the closer this
  measure is to 1, the more similar the words (phrases, documents) are.
  In this paper, we tested two pre-trained models LASER and SBERT. Since these models support
many different languages, no translation of the list of skills into Russian is required.

4.2      Experiment
    Let us give an example of searching for skills in the text using the LASER model. As input, the
following text was passed:
    “Features of the profession; setting up a computer and programs; work with data and lists; functions;
Git; algorithms; Linux; frameworks; libraries; work with models and files; testing; teamwork; 3
projects. Teachers: Lev Lunev is a FullStack developer at the German company Innoscripta. Ivan
Kuznetsov is an active Backend developer. After graduation, you will be able to: program in Python;
create web programs and sites, applications; use frameworks and libraries; work with HTML, CSS,
JavaScript; create complex web interfaces; work with databases; optimize unsuccessful applications;
be part of a real project team; work with the Git version control system”. This text was obtained from
the description of the course “Fullstack developer in Python” (from SF Education and translated in
English from Russian).
    The search result is a list of skills, Table~2.

Table 2
Skills extracted from “Fullstack developer in Python” course description
                  Hard Skills                                       Soft Skills
 Linux, Javascript, Python, Microsoft Team Leadership, Team Management, Team Building,
 Windows XP                                  Program Management, Mobile Devices, Web Browsers,
                                             Computer Literacy, Computer Terminals, Evaluating
                                             Staff,  Calendaring Software, Teamwork, Contract
                                             Reviews, Application Development, Checklists, Web
                                             Conferencing, Personal Computers

    We see that the most found skills are “soft” and what is much worse, the model could not find some
of the “Hard skills” such as HTML and CSS.

4.3      Models Testing

    To assess the accuracy of skill search using pre-trained models, test data were prepared. To do this,
a list of noun phrases was extracted from the descriptions of the curriculum of the three disciplines, and
a set of skills was manually selected as an “expected” result, a total of 259 skills. Such a large number
is explained by the fact that several extracted phrases can correspond to the same skill; manual cleaning
and filtering of such phrases was not carried out. Further, using LASER and SBERT models, the same
test disciplines studying programs were analyzed and skills were extracted. The ratio of the number of
“found” skills to the number of “expected” skills was taken as the test accuracy of the models, Table~3.
The LASER model showed an accuracy of 42.54%, and the SBERT model showed an accuracy of
35.07%. To improve the search accuracy, the list of skills was also translated into Russian, the search
accuracy was 51.53% for LASER and 36.27% for SBERT. From the results we can conclude that both
models find less than half of the "expected" skills, this is since the phrases extracted from the curriculum
can contain the names of skills in combination with other words, which reduces the degree of similarity


                                                    73
when comparing vectors in step 4 of the described algorithm. It is also obvious that the accuracy of the
models increases when the list of skills is coerced into the same language as the curriculum.

Table 3
Comparative analysis of search accuracy
      Models         LASER, accuracy     SBERT, accuracy          # of found skills,    # of found skills,
                             %                  %                       LASER                 SBERT
   Skills without   42.54               35.07                    110                   91
    translation
     Skills with    51.53               36.27                    134                   94
    translation


5. Building and Training a Model for Finding Skills in Course Descriptions
    To search for skills in the descriptions of educational courses, a binary classification model was
trained. A train data was containing 28423 “skill phrases” and 6976 “non-skill phrases”. To form a list
of phrases that are not related to professional skills, the educational standard of the specialty "Computer
Security" was taken, since the curricula are drawn up based on the educational standard and contain
similar vocabulary. Noun phrases were extracted from this document using the phrase machine library
[12]. After that, the extracted phrases were preprocessed, as described above, and vectorized using the
LASER model, which showed the best result in the previous experiment, Table 3. Next, the phrases
with which vectors had high cosine similarity (more than 0.9) with vectors of EMSI skills were
removed; the remaining phrases were also “cleaned up” manually. After preparing the data we received
a dataset of 28,423 skill phrases and 6976 non-skill phrases.
    The constructed model is a neural network with the first Embedding layer with dimension 512, which
transforms positive integers (indices) into dense vectors of fixed size as described in section 2.1.
    The second layer of the model is a Bidirectional Long-Short Memory (LSTM) with 150 neurons.
LSTM is a type of recurrent network with a gated structure to learn long-term dependencies of
sequence-based tasks [15]. A common LSTM unit is composed of a cell, an input gate, an output
gate and a forget gate. The cell remembers values over arbitrary time intervals and the
three gates regulate the flow of information into and out of the cell [16]. The Bidirectional LSTM
processes sequence data in both forward and backward directions with two separate hidden layers and
its architecture based on bidirectional RNN [17].
    The last layer of the described model is a fully connected (Dense) layer for classification with one
neuron and sigmoidal activation function [18]. The dense layer is the regular deeply connected neural
network layer. It is the most common and frequently used layer [19]. Neurons of this layer are connected
with all neurons of neighboring layers. Inside each neuron, a weighted sum of signals from the neurons
of the previous layer is calculated. Then the activation function is applied and this value is the output
of the neuron.
    Figure 2 shows the structure of the network and the number of trained parameters. To train the neural
network, the Adam optimizer [20] was used, which minimizes the binary error of cross-entropy with a
learning rate of 0.01.


                                                    74
Figure 2: Model summary

  Two models of the same architecture, described above, were trained on the data prepared in two
ways:
  1) the skills were divided into 2 subclasses – soft skills and hard skills. Hard skills have not been
  translated into Russian, since this list contains mainly the names of technologies, soft skills have
  been translated into Russian;
  2) all skills were combined into one class, and both “soft” and “hard” skills were translated into
  Russian.

   Models were trained on 70% of the data and validate on 30% of data. The accuracy of the
classification model based on validation data for the first model reached 99.25% in five epochs, for the
second model - 96.16% already in the second epoch, then the model started to overfit, the learning
curves are shown in Figures 3-4. When testing models on new data and comparing “found” skills with
a previously extracted list of “expected” skills, the accuracy of the models was 97.51% and 97.38%,
respectively.


Figure 3: Learning curve for the first model


                                                  75
Figure 4: Learning curve for the srcond model

    For the test data, confusion matrices A1 (1) and A2 (2) were built, which show how the models classify
objects of different classes. A confusion matrix summarizes the classification performance of a
classifier concerning some test data. It is a two-dimensional matrix, indexed in one dimension by the
true class of an object and in the other by the class that the classifier assigns [21]. The matrix element
aij shows the number of objects of class i that the model has classified as class j. For the first model, it
can be seen that the third class, which corresponds to “soft” skills, is classified randomly: since this
class is the smallest, the model is aimed at minimizing errors in more numerous classes. The second
model is more likely to make mistakes in the “not skill” class, which is understandable, since the data
were not balanced and the model seeks to minimize the error in a more numerous class, but in general,
the error in the “not skill” class is not that big - a little more than 4%.

                                             2321,20,4                                                (1)
                                      𝐴1 = (175,9036,22).
                                              28,41,34

                                                 2245,100                                             (2)
                                        𝐴2 = (            ).
                                                 206,9130

   As an example, Figure 5 shows a “word cloud” built on the skills extracted from the description of
the discipline “Machine Learning and Data Analysis” (in Russian) using a trained model and then
translated into English. Skills extracted from the description of the academic discipline are presented
on the "word cloud". The larger the word size, the more often this skill is mentioned in the description.


                                                     76
Figure 5: Extracted skills of the discipline “Machine Learning and Data Analysis”


6. Conclusions
   The article describes the application of distributed vector representation models for the automatic
extraction of professional skills and competencies from the descriptions of academic disciplines. The
process of collecting and processing a dataset for training such models is described. A comparative
analysis of the accuracy of the investigated models is carried out to identify the most accurate model.
The process of building and training of neural network model for skills classification based on a
bidirectional LSTM is described. The results are analyzed using confusion matrices. Further
improvement of the accuracy of the developed neural network model is associated with replacing the
embedding layer with pre-built vectors, for example, using the pre-trained LASER model, both with
increasing the dataset for training and balancing it.

7. References

[1] E.V. Kosareva, Investigation of natural language processing methods and their application in job
    online-search, in: Scientific Collection «InterConf», (36): with the Proceedings of the 7th
    International Scientific and Practical Conference «Challenges in Science of Nowadays»,
    November 26-28, 2020, Washington USA, pp. 21-27.
[2] K.V. Kosarava, N. V. Davydik, Topic modeling application for intellectual analysis of reviews in
    Russian, in: Selected Papers of the V International Scientific and Practical Conference "Distance
    Learning Technologies" (DLT 2020), Yalta, Crimea, September 22-25, 2020, CEUR-WS.org,
    online http://ceur-ws.org/Vol-2914/paper13.pdf.
[3] J. Camacho-Collados, M.T. Pilehvar, From word to sense embeddings: a survey on vector
    representations of meaning, Journal of Artificial Intelligence Research 63 (2018) 743 788.
    doi:10.1613/jair.1.11259.
[4] TensorFlow.org,               Word              embeddings,                2012.            URL:
    https://www.tensorflow.org/text/guide/word_embeddings.
[5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words
    and phrases and their compositionality, in: Proceedings of the 26th International Conference on
    Neural Information Processing Systems (NIPS'13), 2013 (2), pp. 3111 3119.
[6] X. Qiu, T. Sun, Y. Xu et al. Pre-trained models for natural language processing: A survey, Sci.
    China Technol. Sci. 63 (2020) 1872–1897. doi:10.1007/s11431-020-1647-3.
[7] Open-sourcing enhanced LASER library, Zero-shot transfer across 93 languages. URL:
    https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/.
[8] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, BERT: pre-training of deep bidirectional
    transformers for language understanding, in: Proceedings of the 2019 Conference of the North
    American Chapter of the Association for Computational Linguistics: Human Language
                                                  77
     Technologies, Volume 1, Stroudsburg, PA, USA, 2019, pp. 4171–4186. doi:10.18653/v1/N19-
     1423.
[9] Y.      Liu,     et    al.,  RoBERTa:        A    Robustly   Optimized     BERT      Pretraining
     Approach, ArXiv abs/1907.11692 (2019).
[10] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese
     BERT – Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint Conference on Natural Language Processing,
     2019. - p. 3982 3992.
[11] Emsi skills. URL: https://skills.emsidata.com/.
[12] Quickly         extract     multi-word        phrases     from      a       corpus.       URL:
     https://github.com/slanglab/phrasemachine.
[13] Morphological                      analyzer                 pymorphy2                     URL:
     https://pymorphy2.readthedocs.io/en/0.2/user/index.html.
[14] Gensim – Beginner's Guide. URL: https://webdevblog.ru/gensim-rukovodstvo-dlya-
     nachinajushhih/.
[15] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, volume 9
     8(1997) 1735–1780. DOI: 10.1162/neco.1997.9.8.1735.
[16] G. Van Houdt, C. Mosquera and G. Nápoles, A review on the long short-term memory
     model, Artif Intell Rev 53(2020) 5929–5955. DOI: 10.1007/s10462-020-09838-1.
[17] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on
     Signal Processing, volume 45 (1997) 2673–2681. DOI: 10.1109/78.650093.
[18] Siddharth Sharma, Simone Sharma, and A. Athaiya, Activation functions in neural networks,
     International Journal of Engineering Applied Sciences and Technology, 4 (2020) 310-316.
     DOI: 10.33564/IJEAST.2020.v04i12.054.
[19] Keras – Dense Layer. URL: https://www.tutorialspoint.com/keras/keras_dense_layer.htm.
[20] D.P. Kingma and J. Ba. Adam, A Method for Stochastic Optimization, In: Proceedings of the 3rd
     International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015.
     URL: http://arxiv.org/abs/1412.6980.
[21] K.M. Ting, Confusion Matrix, In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning.
     Springer, Boston, MA, 2011. DOI: 10.1007/978-0-387-30164-8_157.


                                                78