Identification of Semantic Patterns in Full-text Documents Using Neural Network Methods O. Zolotarev1, Y. Solomentsev2, A.Khakimova3, M. Charnine4 ol-zolot@yandex.ru, solomencev-yaroslav@mail.ru, aida_khatif@mail.ru, 1@keywen.com 1 Russian New University, Moscow, Russia 2 Moscow Institute of Physics and Technology, Moscow, Russia 3 Research Center for Physical and Technical Informatics, Nizhny Novgorod, Russia 4 Institute of Informatics Problems FRS CSC of the Russian Academy of Sciences, Moscow, Russia Abstract Processing and text mining are becoming increasingly possible thanks to the development of computer technology, as well as the development of artificial intelligence (machine learning). This article describes approaches to the analysis of texts in natural language using methods of morphological, syntactic and semantic analysis. Morphological and syntactic analysis of the text is carried out using the Pullenti system, which allows not only to normalize words, but also to distinguish named entities, their characteristics, and relationships between them. As a result, a semantic network of related named entities is built, such as people, positions, geographical names, business associations, documents, education, dates, etc. The word2vec technology is used to identify semantic patterns in the text based on the joint occurrence of terms. The possibility of joint use of the described technologies is being considered. Keywords: intelligent text analysis, natural language, neural networks Abbreviations of a regulatory act and a contract with its details, analyzing Pullenti = SDK extract named entities from unstructured the title pages, literary characters, incidents, etc [7]. Here texts (Puller of Entities). is an incomplete list of named entities that allocate a Word2vec = a technology (set of models, method) for the program: dates, date ranges, phone numbers, websites, analysis of the semantics of natural languages. sums of money, bank details, keywords and phrases, definitions, measured values and their ranges, countries, 1. Introduction regions, seas, lakes, planets, addresses, streets, This article is devoted to the development of new organizations, persons, passport data, electronic addresses, approaches to the analysis of natural language texts based business facts, links, promotions, product attributes, on the mechanism of neural networks. The article also weapons, relations etc. Selected entities can be represented discusses issues of machine learning, the goal of which is as a connected graph, see fig. 1. to analyze big data, identify patterns and build data processing algorithms based on the patterns found. Initially, the text is marked up, sentences, tokens are highlighted, and morphological analysis of parts of speech takes place. Text processing is carried out using the program Pullenti [5]. With the help of neural networks are hidden patterns in the text. Words for analysis are represented as normalized vectors. For analysis, the word2vec method is used. 2. Features of the Pullenti program Pullenti is a program for processing unstructured natural language texts. Program functions: breaking down text into words, performing morphological analysis, determining of all possible parts of speech of words (regardless of context), normalizing words, bringing words to the desired case / gender / number, highlighting named entities, multiplication of functions with numeric, nominal and verbal groups, brackets, quotes and other useful features [6]. In Pullenti, such objects as persons, organizations, dates, geographic objects, sums of money, etc. are Fig. 1. Graph of selected named entities. distinguished. There are specialized analyzers that cover a certain subject area. For example, identifying the structure Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Pullenti ChatBot technology is designed to develop of the word in the specified line with the word specified in the intellectual part of chat bots. The technology is based the column was indicated in the cells. on the SDK Pullenti (www.pullenti.ru), which contains This paper discusses the skip-gram algorithm for various linguistic processing procedures, including predicting a neighboring word. Further, this approach morphological analysis. In addition, the technology offers covers several words [2]. a number of specific handlers of typical situations arising At the input of the neural network, pairs of words are in the process of dialogue. For example, the selection of a fed, the window size is selected, and then the moving phone number from a sequence in which the numbers are window slides through the text over all the pairs of words given in words, which takes place at the output of voice in this window. If you select a window equal to one, the recognition systems ASR (automatic speech recognition), window will contain one word to the left of the target word assessment of emotional state, typical situation (agreement, and one word to the right of the target word. If the size of refusal, greetings ...), etc. The technology is aimed at the window is equal to two, then to the left and right of the developing the part of the chat bot that mimics its “brain”, target word there will be two words each. Below is an that is, responsible for analyzing text fragments from the example for a window equal to two (articles removed): user (if it’s a voice, then after recognizing it), understanding, extracting data from the text and generating Old abandoned house stands on the edge of the forest text answer. Training phrases: (old, abandoned), (old, house) Here is an example of using Pullenti through Python (Jupyter Notebook). The following program selects name Old abandoned house stands on the edge of the forest groups from arbitrary text: «American President Donald Training phrases: (abandoned, old), (abandoned, Trump wrote on Twitter on Thursday that it was time for house), (abandoned, stands) the US to recognize the Golan Heights as Israel in the interests of the security of Israel and the region as a whole. Old abandoned house stands on the edge of the forest A number of countries in the Middle East and Europe have Training phrases: (house, old), (house, abandoned), already expressed regret in connection with this decision, (house, stands), (house, on) and the Russian Foreign Ministry called it irresponsible and leading to the destabilization of the region». Old abandoned house stands on the edge of the forest The result of the program: «['AMERICAN Training phrases: (stands, abandoned), (stands, old), PRESIDENT', 'PRESIDENT', 'TRUMP', 'THURSDAY', (stands, on), (stands, edge) 'TWITTER', 'PORA', 'GOLANA HEIGHT', 'HEIGHT', 'INTEREST', 'SECURITY', 'REGION', 'WHOLE', The neural network will learn statistics on the 'SERIES', 'STRANA', 'NEAR EAST', 'EAST', 'EUROPE', frequency of occurrence of each pair of words. Every word 'Uzh', 'COMPLAINT', 'CONNECTION', 'DECISION', needs to be converted to digital form. One of the common 'DECISION', 'MFA', 'MASTER','DESTABILIZATION', ways is to present it as a column vector (one-hot encoding), 'REGION']». for example, like this: Pullenti does not include context definition functions, 0 therefore the meaning of a word must be performed by other means, not by the program Pullenti. One of these 1 tools is a program from Google – word2vec. 𝑥⃗ = 0 ... 3. The principle of the technology word2vec on [0] the algorithm Skip-Gram Here our word, which we represent as a vector, takes Word2vec – is a set of models for the analysis of the second place in the dictionary. Transformations using a semantics of natural languages, which is a technology that neural network can be represented as follows (fig. 2). is based on distributional semantics and vector Here x is the input word (or several words) by which representation of words [1]. The word2vec model provides we want to predict, y is the word (or several words). two global operation algorithms: CBOW and Skip-Gram h (hidden layer of the neural network) is a vector [8,9]. CBOW determines the most appropriate word for a obtained by multiplying the word vector x by the matrix of given set of words (by context). Skip-Gram, on the weight coefficients w: contrary, determines the most appropriate set of words to a given word. This article will consider the algorithm Skip- ⃗⃗ = 𝑤 𝑇 ⋅ 𝑥⃗ ℎ Gram. w – is a matrix containing weights, it has the Before the appearance of neural networks, to analyze dimension: (dictionary length) * (number of attributes). the proximity of words, a table of frequency of each word The number of signs is set once before the launch of the was compiled. That is, they made a matrix where words neural network, it is selected to obtain the best result. were horizontally and vertically laid out, and the frequency Example: Google used 300 tags to train a neural network on a variety of Google News data. The weighting coefficients at the initial moment of time take random values, then they are adjusted in accordance with the subtracted proportion of values w, so the matrix w′ method of back error propagation. approaches to the matrix w. Similarly, w can be brought closer to w’: 𝑛𝑒𝑤 𝑜𝑙𝑑 𝑤 ← 𝑤 − 𝐺(𝑤′ ⋅ (1 − 𝑦)) The described above method is not applied in practice, since the calculation of the softmax function is expensive in duration. Therefore, the authors of word2vec proposed an amendment to the algorithm in the form of technology (inclusion function), called “Negative Sampling” (negative sample). This inclusion function is not covered in this article [3,4]. 4. Word2vec example on multiple articles In the example below, we are processing several articles and collections of articles related to virtual reality and modeling. The processing objects are the following documents: Fig. 2. Schematic diagram of the neural network. materials of conferences on programming, computer science, collections of articles, presentations, dissertations. After the hidden layer h, taking into account another The processing program is written in python. matrix of weight coefficients w ’, the vector u is formed: Before processing was only 473 337 words. ⃗⃗ = 𝑤′𝑇 ⋅ ℎ⃗⃗ = 𝑤′𝑇 ⋅ 𝑤 𝑇 ⋅ 𝑥⃗ 𝑢 As a result of processing module Pullenti formed 320 564 words. The dimension of the vector ucoincides with the After the processing the above documents for 20 cycles with the word2vec module there were highlighted 8 dimension of the vector 𝒙⃗⃗. words closest to the word 'virtual'. Here they are: To normalize the output vector 𝑦⃗ in the range [0; 1], Virtual scholar 0.3870989680290222 we use the softmax function (it is used as the activation reality 0.2791272401809692 function, see 𝜎(𝑢 ⃗⃗𝑖 ) figure 2): system 0.2551341354846954 𝑒 𝑢𝑖 google 0.2481471002101898 𝑦𝑖 = 𝜎(𝑢𝑖 ) = 𝑁 ∑𝑘=1 𝑒 𝑢𝑘 pubmed 0.2445603758096695 where N is the number of signs. research 0.1964012682437896 As a result, we obtain that 𝒚𝒊 is the probability of analysis 0.1937002688646316 observing (predicting) the i-th word (or phrase) in the environment 0.0998007102039157 dictionary with the incoming word (context) x. The purpose of the neural network, shown in Figure As a result, there were highlighted a set of closest 1, is to determine the weights w and w’. The criterion for words in the vicinity of term 'virtual'. For each extracted convergence of calculations is the maximization of the word there were highlighted their neighborhood of the most probability y for all possible output words (phrases). As a similar words. result of mathematical transformations (taking the One can construct for each significant term its logarithm of the probability y, then calculating the neighborhood portrait, characteristic of a given subject derivative of the logarithm of the probability y using the area. variable w’) we get an equation for which it is impossible Many significant domain terms form a characteristic to find the optimum. Therefore, it is necessary to use portrait of the domain using a neighborhood approach. The numerical methods. One of the best numerical methods is research results can be used as an original method for the gradient descent method. The result is that you need to semantic comparison and classification of documents. solve a recursive task: There are a lot of methods of text classification such 𝑛𝑒𝑤 𝑜𝑙𝑑 as Word Mover's Distance, Smooth Inverse Frequency, 𝑤′ ← 𝑤′ − 𝐺(𝑤 ⋅ (1 − 𝑦)) Pre-trained encoders and so on. but these methods are not Here G is a gradient descent function. based on building a deep multi-level neighborhood of many Thus, if the probability for the output word being significant terms [10]. searched is maximal, then the expression in parentheses is The approach presented in this paper to construct a new old multi-level neighborhood portrait of a document based on close to zero, and w '  w ' . Otherwise, when the the selection of significant terms using the word2vec probability of output word is very small, then from w’ algorithm is original. In this work, we use only certain functions of Pullenti [4] models.word2vec - Word2vec embeddings [Electronic that have common functions for highlighting some entities. resource]. // Pullenti can use different libraries for different situations. https://radimrehurek.com/gensim/models/word2vec.html# The quality of building model depends on which class the gensim.models.word2vec.Word2Vec/ (appeal date text belongs to. Classification of texts using neural 08/04/2019). networks will allow us to choose special methods of text [5] Zolotarev OV, Sharnin MM, Klimenko SV, Kuznetsov KI processing and improve the quality of the resulting model. System PullEnti - extracting information from natural In Pullenti for complex mining tasks, a higher level language texts and automated building of information presentation of data may be required. systems // Proceedings of the International Conference. Pullenti denotes named entities based on the Situation centers and class 4i information and analytical construction of a chain of adjacent words. The use of neural systems for monitoring and security tasks. SCVRT2015- networks and, in particular, the genism library for 16, Pushchino, TsarGrad, November 21-24, 2015-2016, additional analysis of the text, allows us to define Pushchino, pp. 28-35. significant verbose terms that are in the sentence quite far [6] Deep Contextualized Word Representations / Matthew from each other. In this case, it will be possible to form Peters, Mark Neumann, Mohit Iyyer et al. // Proceedings of semantic named entities and carry out their identification the 2018 Conference of the North American Chapter of the throughout the text based on the analysis of the word Association for Computational Linguistics: Human environment. Language Technologies. — Association for Computational Linguistics, 2018. — Pp. 2227–2237.. Conclusion [7] Zolotarev OV, MM Sharnin, S.V. Klimenko, A.G. An example of the work of the program Pullenti has Matskevich. Research of methods of automatic formation been analyzed, a drawback has been revealed - the lack of of associative-hierarchical portrait of the subject area // definition of the context of words. Bulletin of the Russian New University. Series "Complex An example of the work of the word2vec technology systems: models, analysis and management." - 2018. № 1. has been analyzed, and the problem of training on a small - p. 91 96. amount of data has been revealed. [8] Distributed Representations of Words and Phrases and their During the training of the word2vec model, Compositionality. / Tomas Mikolov, Ilya Sutskever, Kai satisfactory results were obtained with the number of Chen et al. // NIPS / Ed. by Christopher J. C. Burges, L´eon cycles equal to 20. Bottou, Zoubin Ghahramani, Kilian Q. Weinberger. — The use of methods based on neural networks for the 2013. — Pp. 3111–3119. analysis of texts will allow us to switch from text parsing [9] Enriching Word Vectors with Subword Information / Piotr to partially semantic modeling. Bojanowski, Edouard Grave, Armand Joulin, Tomas The approach outlined in this document can be used Mikolov // Transactions of the Association for to analyze texts, compare and classify documents. Computational Linguistics. — 2017. — Vol. 5. — Pp. 135– 146. Acknowledgments [10] Enriching Word Vectors with Subword Information / This work is supported by Russian Foundation for Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Basic Research, grants 18-07-01111, 18-07-00909, 19-07- Mikolov // Transactions of the Association for 00857 and 16-29-09527. Computational Linguistics. — 2017. — Vol. 5. — Pp. 135– We are grateful to the Russian Foundation for Basic 146. Research for financial support of our projects. References [1] Word2Vec: how to work with vector representations of words [Electronic resource]. // https://neurohive.io/ru/osnovy-data-science/word2vec- vektornye-predstavlenija-slov-dlja-mashinnogo- obuchenija/ (appeal date 08/04/2019). [2] Word2Vec Tutorial - The Skip-Gram Model [Electronic resource]. // http://mccormickml.com/2016/04/19/word2vec-tutorial- the-skip-gram-model/ (appeal date 08/04/2019). [3] Ali Ghodsi, Lec 13: Word2Vec Skip-Gram [Electronic resource]. // https://www.youtube.com/watch?v=GMCwS7tS5ZM/ (appeal date 08/04/2019).