Identification of Semantic Patterns in Full-text Documents Using Neural
                                Network Methods
                                   O. Zolotarev1, Y. Solomentsev2, A.Khakimova3, M. Charnine4
                   ol-zolot@yandex.ru, solomencev-yaroslav@mail.ru, aida_khatif@mail.ru, 1@keywen.com
                                            1
                                              Russian New University, Moscow, Russia
                                  2
                                   Moscow Institute of Physics and Technology, Moscow, Russia
                        3
                         Research Center for Physical and Technical Informatics, Nizhny Novgorod, Russia
              4
                Institute of Informatics Problems FRS CSC of the Russian Academy of Sciences, Moscow, Russia

Abstract Processing and text mining are becoming increasingly possible thanks to the development of computer technology, as
well as the development of artificial intelligence (machine learning). This article describes approaches to the analysis of texts
in natural language using methods of morphological, syntactic and semantic analysis. Morphological and syntactic analysis
of the text is carried out using the Pullenti system, which allows not only to normalize words, but also to distinguish named
entities, their characteristics, and relationships between them. As a result, a semantic network of related named entities is built,
such as people, positions, geographical names, business associations, documents, education, dates, etc. The word2vec
technology is used to identify semantic patterns in the text based on the joint occurrence of terms. The possibility of joint use
of the described technologies is being considered.

Keywords: intelligent text analysis, natural language, neural networks

Abbreviations                                                              of a regulatory act and a contract with its details, analyzing
Pullenti = SDK extract named entities from unstructured                    the title pages, literary characters, incidents, etc [7]. Here
texts (Puller of Entities).                                                is an incomplete list of named entities that allocate a
Word2vec = a technology (set of models, method) for the                    program: dates, date ranges, phone numbers, websites,
analysis of the semantics of natural languages.                            sums of money, bank details, keywords and phrases,
                                                                           definitions, measured values and their ranges, countries,
1. Introduction                                                            regions, seas, lakes, planets, addresses, streets,
      This article is devoted to the development of new                    organizations, persons, passport data, electronic addresses,
approaches to the analysis of natural language texts based                 business facts, links, promotions, product attributes,
on the mechanism of neural networks. The article also                      weapons, relations etc. Selected entities can be represented
discusses issues of machine learning, the goal of which is                 as a connected graph, see fig. 1.
to analyze big data, identify patterns and build data
processing algorithms based on the patterns found.
Initially, the text is marked up, sentences, tokens are
highlighted, and morphological analysis of parts of speech
takes place. Text processing is carried out using the
program Pullenti [5]. With the help of neural networks are
hidden patterns in the text. Words for analysis are
represented as normalized vectors. For analysis, the
word2vec method is used.

2. Features of the Pullenti program
      Pullenti is a program for processing unstructured
natural language texts. Program functions: breaking down
text into words, performing morphological analysis,
determining of all possible parts of speech of words
(regardless of context), normalizing words, bringing words
to the desired case / gender / number, highlighting named
entities, multiplication of functions with numeric, nominal
and verbal groups, brackets, quotes and other useful
features [6].
      In Pullenti, such objects as persons, organizations,
dates, geographic objects, sums of money, etc. are
                                                                                       Fig. 1. Graph of selected named entities.
distinguished. There are specialized analyzers that cover a
certain subject area. For example, identifying the structure


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
      Pullenti ChatBot technology is designed to develop         of the word in the specified line with the word specified in
the intellectual part of chat bots. The technology is based      the column was indicated in the cells.
on the SDK Pullenti (www.pullenti.ru), which contains                  This paper discusses the skip-gram algorithm for
various linguistic processing procedures, including              predicting a neighboring word. Further, this approach
morphological analysis. In addition, the technology offers       covers several words [2].
a number of specific handlers of typical situations arising            At the input of the neural network, pairs of words are
in the process of dialogue. For example, the selection of a      fed, the window size is selected, and then the moving
phone number from a sequence in which the numbers are            window slides through the text over all the pairs of words
given in words, which takes place at the output of voice         in this window. If you select a window equal to one, the
recognition systems ASR (automatic speech recognition),          window will contain one word to the left of the target word
assessment of emotional state, typical situation (agreement,     and one word to the right of the target word. If the size of
refusal, greetings ...), etc. The technology is aimed at         the window is equal to two, then to the left and right of the
developing the part of the chat bot that mimics its “brain”,     target word there will be two words each. Below is an
that is, responsible for analyzing text fragments from the       example for a window equal to two (articles removed):
user (if it’s a voice, then after recognizing it),
understanding, extracting data from the text and generating           Old abandoned house stands on the edge of the forest
text answer.                                                          Training phrases: (old, abandoned), (old, house)
      Here is an example of using Pullenti through Python
(Jupyter Notebook). The following program selects name                Old abandoned house stands on the edge of the forest
groups from arbitrary text: «American President Donald                Training phrases: (abandoned, old), (abandoned,
Trump wrote on Twitter on Thursday that it was time for          house), (abandoned, stands)
the US to recognize the Golan Heights as Israel in the
interests of the security of Israel and the region as a whole.        Old abandoned house stands on the edge of the forest
A number of countries in the Middle East and Europe have              Training phrases: (house, old), (house, abandoned),
already expressed regret in connection with this decision,       (house, stands), (house, on)
and the Russian Foreign Ministry called it irresponsible and
leading to the destabilization of the region».                        Old abandoned house stands on the edge of the forest
      The result of the program: «['AMERICAN                          Training phrases: (stands, abandoned), (stands, old),
PRESIDENT', 'PRESIDENT', 'TRUMP', 'THURSDAY',                    (stands, on), (stands, edge)
'TWITTER', 'PORA', 'GOLANA HEIGHT', 'HEIGHT',
'INTEREST', 'SECURITY', 'REGION', 'WHOLE',                             The neural network will learn statistics on the
'SERIES', 'STRANA', 'NEAR EAST', 'EAST', 'EUROPE',               frequency of occurrence of each pair of words. Every word
'Uzh', 'COMPLAINT', 'CONNECTION', 'DECISION',                    needs to be converted to digital form. One of the common
'DECISION', 'MFA', 'MASTER','DESTABILIZATION',                   ways is to present it as a column vector (one-hot encoding),
'REGION']».                                                      for example, like this:
      Pullenti does not include context definition functions,                                  0
therefore the meaning of a word must be performed by
other means, not by the program Pullenti. One of these
                                                                                               1
tools is a program from Google – word2vec.                                               𝑥⃗ = 0
                                                                                              ...
3. The principle of the technology word2vec on                                               [0]
the algorithm Skip-Gram                                               Here our word, which we represent as a vector, takes
      Word2vec – is a set of models for the analysis of the      second place in the dictionary. Transformations using a
semantics of natural languages, which is a technology that       neural network can be represented as follows (fig. 2).
is based on distributional semantics and vector                       Here x is the input word (or several words) by which
representation of words [1]. The word2vec model provides         we want to predict, y is the word (or several words).
two global operation algorithms: CBOW and Skip-Gram                   h (hidden layer of the neural network) is a vector
[8,9]. CBOW determines the most appropriate word for a           obtained by multiplying the word vector x by the matrix of
given set of words (by context). Skip-Gram, on the
                                                                 weight coefficients w:
contrary, determines the most appropriate set of words to a
given word. This article will consider the algorithm Skip-                              ⃗⃗ = 𝑤 𝑇 ⋅ 𝑥⃗
                                                                                        ℎ
Gram.                                                                 w – is a matrix containing weights, it has the
      Before the appearance of neural networks, to analyze       dimension: (dictionary length) * (number of attributes).
the proximity of words, a table of frequency of each word        The number of signs is set once before the launch of the
was compiled. That is, they made a matrix where words            neural network, it is selected to obtain the best result.
were horizontally and vertically laid out, and the frequency     Example: Google used 300 tags to train a neural network
                                                                 on a variety of Google News data. The weighting
                                                                 coefficients at the initial moment of time take random
values, then they are adjusted in accordance with the         subtracted proportion of values w, so the matrix w′
method of back error propagation.                             approaches to the matrix w.
                                                                   Similarly, w can be brought closer to w’:
                                                                         𝑛𝑒𝑤       𝑜𝑙𝑑
                                                                          𝑤 ← 𝑤 − 𝐺(𝑤′ ⋅ (1 − 𝑦))
                                                                   The described above method is not applied in
                                                              practice, since the calculation of the softmax function is
                                                              expensive in duration. Therefore, the authors of word2vec
                                                              proposed an amendment to the algorithm in the form of
                                                              technology (inclusion function), called “Negative
                                                              Sampling” (negative sample). This inclusion function is
                                                              not covered in this article [3,4].

                                                              4. Word2vec example on multiple articles
                                                                    In the example below, we are processing several
                                                              articles and collections of articles related to virtual reality
                                                              and modeling.
                                                                    The processing objects are the following documents:
       Fig. 2. Schematic diagram of the neural network.       materials of conferences on programming, computer
                                                              science, collections of articles, presentations, dissertations.
     After the hidden layer h, taking into account another    The processing program is written in python.
matrix of weight coefficients w ’, the vector u is formed:          Before processing was only 473 337 words.
           ⃗⃗ = 𝑤′𝑇 ⋅ ℎ⃗⃗ = 𝑤′𝑇 ⋅ 𝑤 𝑇 ⋅ 𝑥⃗
           𝑢                                                        As a result of processing module Pullenti formed 320
                                                              564 words.
     The dimension of the vector      ucoincides with the           After the processing the above documents for 20
                                                              cycles with the word2vec module there were highlighted 8
dimension of the vector 𝒙⃗⃗.                                  words closest to the word 'virtual'. Here they are:
     To normalize the output vector 𝑦⃗ in the range [0; 1],    Virtual scholar              0.3870989680290222
we use the softmax function (it is used as the activation                  reality          0.2791272401809692
function, see 𝜎(𝑢
                ⃗⃗𝑖 ) figure 2):                                           system           0.2551341354846954
                             𝑒 𝑢𝑖                                          google           0.2481471002101898
              𝑦𝑖 = 𝜎(𝑢𝑖 ) = 𝑁
                           ∑𝑘=1 𝑒 𝑢𝑘                                       pubmed           0.2445603758096695
      where N is the number of signs.                                      research         0.1964012682437896
      As a result, we obtain that 𝒚𝒊 is the probability of                 analysis         0.1937002688646316
observing (predicting) the i-th word (or phrase) in the                    environment 0.0998007102039157
dictionary with the incoming word (context) x.
      The purpose of the neural network, shown in Figure            As a result, there were highlighted a set of closest
1, is to determine the weights w and w’. The criterion for    words in the vicinity of term 'virtual'. For each extracted
convergence of calculations is the maximization of the        word there were highlighted their neighborhood of the most
probability y for all possible output words (phrases). As a   similar words.
result of mathematical transformations (taking the                  One can construct for each significant term its
logarithm of the probability y, then calculating the          neighborhood portrait, characteristic of a given subject
derivative of the logarithm of the probability y using the    area.
variable w’) we get an equation for which it is impossible          Many significant domain terms form a characteristic
to find the optimum. Therefore, it is necessary to use        portrait of the domain using a neighborhood approach. The
numerical methods. One of the best numerical methods is       research results can be used as an original method for
the gradient descent method. The result is that you need to   semantic comparison and classification of documents.
solve a recursive task:                                             There are a lot of methods of text classification such
           𝑛𝑒𝑤       𝑜𝑙𝑑                                      as Word Mover's Distance, Smooth Inverse Frequency,
           𝑤′ ← 𝑤′ − 𝐺(𝑤 ⋅ (1 − 𝑦))                           Pre-trained encoders and so on. but these methods are not
     Here G is a gradient descent function.                   based on building a deep multi-level neighborhood of many
     Thus, if the probability for the output word being       significant terms [10].
searched is maximal, then the expression in parentheses is          The approach presented in this paper to construct a
                     new        old                           multi-level neighborhood portrait of a document based on
close to zero, and w '  w ' . Otherwise, when the            the selection of significant terms using the word2vec
probability of output word is very small, then from w’        algorithm is original.
         In this work, we use only certain functions of Pullenti    [4] models.word2vec - Word2vec embeddings [Electronic
   that have common functions for highlighting some entities.           resource].                                                 //
   Pullenti can use different libraries for different situations.       https://radimrehurek.com/gensim/models/word2vec.html#
   The quality of building model depends on which class the             gensim.models.word2vec.Word2Vec/             (appeal    date
   text belongs to. Classification of texts using neural                08/04/2019).
   networks will allow us to choose special methods of text         [5] Zolotarev OV, Sharnin MM, Klimenko SV, Kuznetsov KI
   processing and improve the quality of the resulting model.           System PullEnti - extracting information from natural
         In Pullenti for complex mining tasks, a higher level           language texts and automated building of information
   presentation of data may be required.                                systems // Proceedings of the International Conference.
         Pullenti denotes named entities based on the                   Situation centers and class 4i information and analytical
   construction of a chain of adjacent words. The use of neural         systems for monitoring and security tasks. SCVRT2015-
   networks and, in particular, the genism library for                  16, Pushchino, TsarGrad, November 21-24, 2015-2016,
   additional analysis of the text, allows us to define                 Pushchino, pp. 28-35.
   significant verbose terms that are in the sentence quite far     [6] Deep Contextualized Word Representations / Matthew
   from each other. In this case, it will be possible to form           Peters, Mark Neumann, Mohit Iyyer et al. // Proceedings of
   semantic named entities and carry out their identification           the 2018 Conference of the North American Chapter of the
   throughout the text based on the analysis of the word                Association for Computational Linguistics: Human
   environment.                                                         Language Technologies. — Association for Computational
                                                                        Linguistics, 2018. — Pp. 2227–2237..
   Conclusion                                                       [7] Zolotarev OV, MM Sharnin, S.V. Klimenko, A.G.
         An example of the work of the program Pullenti has             Matskevich. Research of methods of automatic formation
   been analyzed, a drawback has been revealed - the lack of            of associative-hierarchical portrait of the subject area //
   definition of the context of words.                                  Bulletin of the Russian New University. Series "Complex
         An example of the work of the word2vec technology              systems: models, analysis and management." - 2018. № 1.
   has been analyzed, and the problem of training on a small            - p. 91 96.
   amount of data has been revealed.                                [8] Distributed Representations of Words and Phrases and their
         During the training of the word2vec model,                     Compositionality. / Tomas Mikolov, Ilya Sutskever, Kai
   satisfactory results were obtained with the number of                Chen et al. // NIPS / Ed. by Christopher J. C. Burges, L´eon
   cycles equal to 20.                                                  Bottou, Zoubin Ghahramani, Kilian Q. Weinberger. —
         The use of methods based on neural networks for the            2013. — Pp. 3111–3119.
   analysis of texts will allow us to switch from text parsing      [9] Enriching Word Vectors with Subword Information / Piotr
   to partially semantic modeling.                                      Bojanowski, Edouard Grave, Armand Joulin, Tomas
         The approach outlined in this document can be used             Mikolov // Transactions of the Association for
   to analyze texts, compare and classify documents.                    Computational Linguistics. — 2017. — Vol. 5. — Pp. 135–
                                                                        146.
   Acknowledgments                                                  [10] Enriching Word Vectors with Subword Information /
        This work is supported by Russian Foundation for                Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas
   Basic Research, grants 18-07-01111, 18-07-00909, 19-07-              Mikolov // Transactions of the Association for
   00857 and 16-29-09527.                                               Computational Linguistics. — 2017. — Vol. 5. — Pp. 135–
        We are grateful to the Russian Foundation for Basic             146.
   Research for financial support of our projects.

    References
[1] Word2Vec: how to work with vector representations of
    words           [Electronic          resource].       //
    https://neurohive.io/ru/osnovy-data-science/word2vec-
    vektornye-predstavlenija-slov-dlja-mashinnogo-
    obuchenija/ (appeal date 08/04/2019).
[2] Word2Vec Tutorial - The Skip-Gram Model [Electronic
    resource].                                            //
    http://mccormickml.com/2016/04/19/word2vec-tutorial-
    the-skip-gram-model/ (appeal date 08/04/2019).
[3] Ali Ghodsi, Lec 13: Word2Vec Skip-Gram [Electronic
    resource].                                            //
    https://www.youtube.com/watch?v=GMCwS7tS5ZM/
    (appeal date 08/04/2019).