Applıcatıon of Paragraphs Vectors Model for Semantıc Text Analysıs Irina Gruzdo 1[0000-0002-4399-2367], Iryna Kyrychenko 1[0000-0002-7686-6439], Glib Tereshchenko 1[0000-0001-8731-2135], Olga Cherednichenko 2[0000-0002-9391-5220] 1 Kharkiv National University of Radioelectronics, Kharkiv, Ukraine {irina.gruzdo,iryna.kyrychenko, hlib.tereshchenko}@nure.ua 2 National Technical University "KhPI", Kharkiv, Ukraine olha.cherednichenko@gmail.com Abstract. The paper examined a model of paragraph vectors, as well as its methods of distributed memory and distributed bag of words. The peculiarity of this model lies in the definition of the objective functions of individual sentences and their representation in the form of some local vectors, on the basis of which a global vector is constructed, which determines the semantic component of the text as a whole. Various aspects of the application of distributed memory and distributed bag of words methods were considered, as well as the sets of algorithms of the underlying distributed memory and distributed bag of words methods, which allow obtaining distributed vectors of text parts to solve the problem of determining similar articles, where the search will be carried out key words, annotations, and articles of various sizes. It was experimentally established that Doc2Vec and its Bag-of-Words method, the most complete, allows you to determine borrowing and analogues depending on the structural elements of the text, in accordance with the review and the task. Also Bag-of- Words allows the user to make an exact picture of the lexical meaning of a word and its semantic relations in language and texts. Keywords: Text Meaning Definition, Semantic Analysis, Latent-Semantic Analysis, Experiment, Textual Information, Model, Semantic Analysis Library, Text Analysis, Text Fragment. 1 Introduction At the present stage of development of information technologies, both worldwide and in Ukraine, the tasks related to the processing of textual information for solving a num- ber of tasks such as plagiarism detection, text recognition, highlighting the structural blocks of text, analysis and issuance of recommendations, etc. [1, 2, 3]. Among all these tasks, one of the essential problems, which has been solved for more than 60 years and is the “cornerstone”, is the problem of semantic analysis of the text [1, 4, 5]. In [9–15], approaches to checking semantic correctness are shown. During the analysis of the pri- mary sources of the first works devoted to semantic analysis, a tendency was observed Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). to divide the work into those that consider solving abstract theoretical problems, and those that are aimed at facilitating work with a computer and software implementation of solutions. It should be noted that all approaches and models described in them are aimed at solving a specific problem and therefore can only be applied to a narrow circle of subject areas. The mathematical apparatus is also suitable only for formalization of some linguistic mechanisms [14, 15]. Determining the meaning of texts will, to a con- siderable degree, allow solving a number of word processing tasks with a greater degree of correctness, since it allows improving the analysis procedure. In [4], a review and analysis of existing solutions for determining the meaning of text documents was carried out, the most used models and methods of semantic text processing were considered, and the classical text processing process for semantic anal- ysis was also described. It should also be noted that when applying the models consid- ered in [4] in practice, in most cases there is a partial loss of the meaningful meaning of the text. This fact is not always justified from the point of view of the problem being solved, although it allows to perform some procedures of semantic analysis. Among the considered models, the model of paragraph vectors allows to solve the problem of word processing with greater accuracy when plagiarism is detected, as well as text recognition. Therefore, by virtue of the above, this article will consider practical aspects related to determining the meaning of textual information and using the model of paragraph vectors. The work of the model of the vectors of paragraphs and how it behaves on various text blocks of information to solve the problem of finding similarity of docu- ments will be considered in more detail. Two methods will be considered: distributed memory and distributed bag of words. Also in the work will be tested the work of the model of the vectors of paragraphs on the task of processing different length of text fragments for the models of Wikipedia and APNews. Namely, for the task of determin- ing the search for similar articles, where the search will be conducted by keywords, by annotation, and by articles of different size. The purpose of this work is to test the applicability of the model of paragraph vectors for semantic text analysis as a subtask aimed at solving text processing problems when plagiarism is detected, as well as recognizing text and justifying the choice of the con- sidered method for practical use. As a result of solving this problem, a reasonable con- clusion will be made about the applicability of the model of paragraph vectors for se- mantic text analysis as a subtask aimed at solving text processing problems when pla- giarism is detected and text is recognized. 2 Checked model The model of the vectors of paragraphs is very well described in [6, 7, 8]. This model, while defining the meaning of text documents, adds a memory vector to the standard language model aimed at defining the subject of the document. This model allows you to work with documents of different lengths, such as fragments of texts, sentences, paragraphs and documents. That is why this model was chosen for analysis, since it is this model that allows to determine borrowing and analogs depending on the structural elements of the text, and also satisfies the search task: by keywords, by annotation, and by articles with different volumes. The model of paragraph vectors in the classical form allows us to solve the problem of predicting a word given in other words in the context of the analyzed text. In context, each word is displayed as a unique vector represented by a column in the matrix. The column is indexed by the position of the word in the dictionary. The concatenation or sum of the vectors is subsequently used as signs to predict the next word in the sentence. In general, the model of paragraph vectors can be represented schematically (see Fig. 1). Classifier Хi+3 Average/Concatenate Paragraph Matrix D W W W Paragraph xi x i+1 x id Fig. 1. Paragraph vector associative memory model for input sentence [2] The figure shows that the paragraph vector is concatenated or averaged using local contextual word vectors to predict the next word. The prediction task changes the word vector and paragraph vector. It should be noted that the classical model does not fully allow implementing soft- ware calculations and testing in practice, as well as drawing reasonable conclusions about the applicability to the task of finding similarity of documents. Also in the clas- sical form in the model of the vectors of paragraphs with a larger amount of data, an increase in the size of the body is observed, therefore, the dimension of the vectors, which will lead to high computational complexity. Therefore, it is necessary to consider this model in more detail and describe the work in the form of an algorithm. To do this, you need to understand what methods underlie it. Considering the above, it should be noted that the paragraph vector model consists of two methods: distributed memory (DM, distributed memory) and distributed bag of words (DBOW, distributed bag of words). The DM method predicts a word from known preceding words and a paragraph vector. DBOW predicts random groups of words in a paragraph only on the basis of the paragraph vector. These methods are fully implemented in Python and are presented by Word2Vec and Doc2Vec algorithms, in the gensim library. Consider their work in more detail. Word2Vec is a set of algorithms for calculating vector representations of words, as- suming that words used in similar contexts are semantically close. First, a dictionary is created, and then a vector representation of the words is calculated. A vector represen- tation is based on contextual proximity, the essence of which is that words found in the text next to the same words will have close coordinates of the vectors - words. Algorithm Word2Vec work as follows [9]: -the body is calculated and calculated the occurrence of each word in the body; -the array of words is sorted by frequency and deleted rare words; -build a Huffman tree (for dictionary coding - it greatly reduces computational and time complexity algorithm); -from the case is read so-called. Submission which is the basic element of the body - sentence, paragraph, article, after Subsampling is carried out most frequent words) from analysis; -pass on the proposal, calculating maximum distance between current and the pre- dicted word in the sentence; -further applied neural network direct distribution with activation function hierar- chical softmax and / or negative sampling. Word2Vec is based on two Continuous Bag-of-Words (CBOW) and Skip-gram al- gorithms. CBOW and Skip-gram are neural network architectures that describe exactly how a neural network “learns” from data and “remembers” the representations of words. Their working principles are different, CBOW performs the prediction of a word for a given context, and skip-gram allows you to predict a context for a given word. The operation of the Continuous Bag-of-Words (CBOW) algorithm is described in great detail in the article by David Meyer [6]. The basis of CBOW is the log likelihood calculation: ℒ = ∑w∈D log p(w|c, 𝜃) , (1) where 𝜃 - model parameters; w is the current word; c is the context of the current word. Schematically, the work of CBOW can be represented in the form (see Fig. 2). When considering, it is necessary to take into account that the same matrix is used for learning CBOW, which receives several input vectors representing different context words. Training is performed using a simple neural network, using the passage through the entire collection. In- put: one-hot representation of the word (vector of length |T|). Output: distribution in words of the collection (vector of length |T|). The probability p(w|c, 𝜃) is modeled by a softmax-function. CBOW learning difficulties: Q1 = O (N × D + D × log2 |V|), (2) The output layer remains the same, and learning is performed in the manner de- scribed above. .. Connections .. . . WI matrix .. x1 .. .. . VxN .. . h1 y1 .. h2 . .. ... y2 Connections .. . WI matrix .. .. . .. . WI matrix . x2 .. hk . .. VxN .. . yk .. VxN .. .. .. .. . .. Connections . hn yn .. Connections .. . . WI matrix xc .. ... .. VxN .. . Input Layer Hiddel Layer Output Layer Fig. 2. Continuous Bag-of-Words [7] Skip-gram is based on calculating a maximizing objective function. L = ∑t,k ∑ j∈Contextk(t) log P(wj |wt). (3) The Skip-gram neural network is a two-layer network. The second layer implements hierarchical softmax. The cardinal difference from the CBOW model is that the word wt is predicted as many times as there are words in its context, and each time it is predicted based on only one of the words in the context. The work of Skip-gram can be represented graphically (see Fig. 3). Hiddel Layer Output Layer Input Vector Linear Neurons Softmax Classifier 0 Probability that the word at a 0 Σ Σ randomly chosen, nearby 0 position is “abandon” 0 0 Σ … “ability” corresponding to the Σ A “1” in the position 0 word “ants” 0 1 .. . Σ … “able” 0 .. .. .. 0 .. . .. . .. . .. 0 Σ Σ … “zone” 10 000 positions 10 000 300 neurons neurons Fig. 3. Neural network for Skip-Gram [8] The complexity of learning Skip-gram is calculated as: Q2 = O (N × D + N × D × log2 |V|). (4) According to the authors, training in the Skip-gram model is more expensive, how- ever, according to [12], this model usually gives the best results. Given the above, we can conclude that CBOW works faster and Skip-gram works better, especially for relatively rare words. Since words in a Word2Vec package may appear anywhere in the entire package, each associated vector will receive several adjustments early, middle and late in the process as the model improves - even with a single pass. Doc2Vec (the original name of the Paragraph Vector), as well as Word2Vec, is a set of algorithms that allow to obtain distributed vectors for parts of texts [12]. Texts can be of variable length: from a sentence to a large document. Doc2Vec allows you to work simultaneously with the words and labels on the document, this fact is very useful in solving the problem of finding similarity of documents as a subtask of finding pla- giarism. In Doc2Vec, vector representations of documents are trained to predict words in a document, more precisely, a document vector is taken and combined with several word vectors from it, i.e. tries to predict the next word according to the context. Word and document vectors are trained using the stochastic gradient descent method and the back- propagation method. Document vectors are unique, and the vectors of the same words in different documents are the same. There are two architectures for building vector representations of documents: Dis- tributed Memory (DM, D2V-DM) and Distributed Bag-of-Words (DBOW, D2V- DBOW). In Distributed Memory, each document is represented by a unique vector as a col- umn in the matrix, and each term is represented by a unique vector as a column in the matrix. A vector document and word vectors in it are combined or averaged to predict the next word from the context. In fig. 4 shows a graphical interpretation of this archi- tecture. Input Projection Output v(doc) v(t-2) v(t-1) v(t) v(t+1) Concatenated v(t+2) Representation Fig. 4. Distributed Memory (DM, D2V-DM) In Distributed Memory, you can think of tokens as separate words. They act as a memory that remembers what is missing in the current context or subject of the docu- ment. For the same reason, the model is called Distributed Memory. Distributed Bag-of-Words is simpler than DM, it ignores word order, and as a result, the learning phase goes faster. In DBOW, ignoring words from an input context is used, but it is necessary to predict randomly selected words for an output document. At each iteration of the stochastic gradient descent, a text window is viewed, then a random word is viewed in the text window and a classification task is formed based on the document vector. Doc2vec is a logical development of the Word2vec model that implements the pop- ular bag-of-words method. The difference is that when creating a vector of sentences, the word order is taken into account. In Fig. 5 shows a graphical interpretation of this architecture [8]. Input Projection v(t-2) v(t-1) v(doc) v(t) v(t+1) Fig. 5. Doc2Vec - DBOW (distributed bag of words) In Doc2Vec, it is possible to place both vectors and vectors words in “the same space”, which makes them more interpretable by their proximity to words. It should also be remembered that the learning rate can sometimes decrease during the iteration according to Labeled Sentence during training, and sometimes this can lead to non-optimal results, therefore it is necessary to double-check the results by re- peating the data several times. Because Doc2Vec often uses unique identifier tags for each document, more itera- tions may be more important for each vector document to come in for learning several times as the training progresses, as the model gradually improves. Doc2Vec, according to the authors, is faster than Word2Vec and consumes less memory, since there is no need to save word vectors. Although Word2Vec is good at the vector word representation, it was not intended to generate a single word from sev- eral words found in a sentence, paragraph, or document. Doc2Vec also allows you to find a cosine similarity between two documents. Word2vec shows high results in relation to very short texts, such as messages on the social network Twitter [8, 10]. Doc2Vec works more efficiently with longer messages. In view of the above, for the experiment on solving the problem of finding similarity of documents as a subtask of finding plagiarism, Doc2Vec was chosen. 3 Experiment Independent software assessments at the design stage in most cases are performed by people who do not always take into account the relationship between software quality and the development team and resources that are on the project, which in turn leads to erroneous results and is impractical. Word2Vec and Doc2Vec are implemented in Python in the gensim library. For the experiment, Doc2Vec and the Bag-of-Words method were chosen, as it allows to determine borrowing and analogues depending on the structural elements of the text, and also satisfies the search task: by keywords, by annotation, and by articles with different volume. Also, the study of the speed of execution and analysis of the texts of the considered models. In comparison, two models were used: - Wikipedia is a model that has been trained on articles in the English version of Wikipedia. - APNews - a model that was trained on Associated PressNews articles. For the problem of determining the search for similar articles, the analysis will be conducted depending on the words in the text. Analysis by keywords - the size of textual information is from 1-5 words to 75 words, by annotation - 75-150 words to 400 words, and by articles of various sizes - 400-500 words to 2000 (small) and 2000-3000 words (average). Below is a summary table that shows the average processing time for different lengths of text fragments for both models (Table 1). Table 1. The average processing time of different along the length of the text fragments of the considered models. Wikipedia AP News Text 1-5 words 2.5 seconds 1.9 seconds Text 75-150 words 75 seconds 51 seconds Text 400-500 words 100 seconds 93 seconds Text 2000-3000 words 5.7 seconds 4.6 seconds Looking at the results, you can see that the processing of information is accelerated on a larger number of words. To verify the results obtained, the data were re-checked several times, and the results were the same. Therefore, it is safe to say that information processing is accelerating with an increase in the number of words. It is not clear why such a decrease in the time spent occurs with an increase in the number of words, since it was not possible to find a mathematical description of the Doc2Vec family of algo- rithms. The following assumption can be made: the semantic core itself occupies 2.2 GB, then there is a Python server that processes client requests — it also takes a certain place, and the browser itself, which takes about 1.5 GB of memory. Testing was con- ducted on a laptop with 4 GB of RAM. When a project is launched for testing, it either fits completely into RAM (which does not correspond to the observed reality in the task manager), or uses the memory completely and then begins to use a paging file that is on a slow HDD and is no longer part of the RAM. Algorithmic speed of driving text through the semantic core is linear, that is, O (n), where n is the number of words in the text. Therefore, we can conclude that, provided there is a sufficient amount of RAM, the processing speed of processing the client's request will be approximately the same over relatively large intervals of the length of the input text, that is, the comparison table could have approximately the same numbers for each text size. However, given the lack of RAM, errors are added to the runtime that affect the results. Those. for a more complete picture you need to rent an external server, for additional verification of the results obtained. For the time being, it can be concluded that using the method of paragraph vectors for semantic analysis of text as a subtask aimed at solving text processing problems when plagiarism is detected, as well as text recognition, is appropriate. With this in mind, it can be assumed that the additional time that will be allocated to semantic word processing will not greatly influence the overall text processing process. By virtue of the above, we can conclude that when analyzing various types of work that differ in volume, the time spent on semantic word processing will be approximately the same. This, in turn, allows, without significant time expenditures in determining the meaning of text documents, to add additional steps related to semantics, and thereby increase the accuracy of the results obtained. 4 Conclusions In the course of this work: -Considered a model of paragraph vectors, as well as methods of distributed memory and distributed bag-of-words. The peculiarity of the approach under consideration is the definition of the objective functions of individual sentences and their representation in the form of some local vectors, on the basis of which a global vector is constructed, defining the semantic component of the text as a whole; -Various aspects of the application of distributed memory and distributed bag-of- words; -Studied a set of algorithms that allow to obtain distributed vectors for parts of texts for solving the problem of determining similar articles, where the search will be carried out by keywords, by annotation and by articles different in size; -An experiment was conducted for the Bag-of-Words method Doc2Vec, since it was he who most fully, in accordance with the overview and the task, allows to determine borrowing and analogs depending on the structural elements of the text; -It has been confirmed in practice that the Bag-of-Words Doc2Vec method allows the user to get an accurate picture of the lexical meaning of a word and its semantic relationships in language and texts. As a conclusion throughout the paper, we can say that semantic analysis has a high practical application for determining the meaning of text documents. In a number of publications devoted to the model of paragraph vectors, which deals with the methods of distributed memory and distributed bag-of-words, it is noted that these methods work in the same way, but they allow solving a different spectrum of problems. But after conducting a study, it was found that Word2vec shows high results in relation to very short texts, and Doc2Vec works more efficiently with longer mes- sages. Consequently, the algorithm of their work should be different. In general, the considered methods Word2Vec and Doc2Vec for the semantic anal- ysis act in the same type, the results of their work are quite similar, but they allow to solve different problems associated with semantic analysis. It should also be noted that in the literature Doc2Vec is very poorly described and it is impossible to fully judge its mathematical apparatus, although Python has the gensim library and allows using the necessary set of algorithms. It was also found that when analyzing various types of work that differ in volume, using Doc2Vec and the Bag-of-Words method, the time spent on semantic word pro- cessing will be approximately the same. This, in turn, allows, without significant time expenditures in determining the meaning of text documents, to add additional steps related to semantics, and thereby increase the accuracy of the results obtained. The obtained results will allow to continue the work on solving the problem of ana- lyzing texts for the presence of text borrowings and borrowing ideas, as well as deter- mining the authorship of the text, taking into account its paraphrasing. References 1. Sitikhu, P., Pahi, K., Thapa, P., Shakya, S.: A Comparison of Semantic Similarity Methods for Maximum Human Interpretability. arXiv preprint arXiv:1910.09129 (2019). 2. Panigrahi, A., Simhadri, H. V., Bhattacharyya, C.: Word2Sense: Sparse Interpretable Word Embeddings. In:Proceedings of the57th Annual Meeting of the Association for Computational Linguistics, pp. 5692-5705 (2019). 3. Kanishcheva, O., Cherednichenko, O., Sharonova, N.: Image Tag Core Generation. In: 1st International Workshop on Digital Content & Smart Multimedia (DCSMart 2019) Ukraine, CEUR Workshop Proceedings, Volume 1, pp. 35-44. Lviv (2019). http://ceur-ws.org/Vol-2533/preface.pdf 4. Gruzdo, I.: Overview and Analysis of Existing Decisions of Gothic, International Sci- entific and Practical Conference of Infocommunications. Science and Technology PIC S & T, pp.645-653. Kharkiv, Ukraine (2018). 5. Vysotska, V., Lytvyn, V., Kovalchuk, V., Kubinska, S., Dilai, M., Rusyn, B., Pohre- liuk, L., Chyrun, L., Chyrun, S., Brodyak, O.: Method of Similar Textual Content Se- lection Based on Thematic Information Retrieval. In: Proceedings of the International Conference on Computer Sciences and Information Technologies, CSIT, 1-6. (2019) 6. Dai, A., Olah, Ch., Le Q.: Document Embedding with Paragraph Vectors (2015) https://arxiv.org/abs/1507.07998 7. Beutel, A., Covington, P., Jain, S., Xu, C., Li, J.: Latent Cross: Making Use of Context in Recurrent Recommender Systems, Ed H. Chi. WSDM'18, Marina Del Rey, CA, USA (2018). 8. Le,Q., Mikolov, T.: Distributed Representations and Documents. In: 31st International Conference on Machine Learning, JMLR: W & CP, vol. 32 (2), pp. 1188-1196. Beijing, China (2014). 9. Smooth, A.V., Ideas, M.M.: Bakhtin on utterance and dialogue and their significance for the formal semantics of natural language. In the book: Interactive systems: Reports and theses of reports and messages of the third school-seminar.-Tbilisi: Metsniereba, vol. I, p. 33-43 (1981). 10. Golovina, E.A., Kolmychek, K.N., Terzian, V.N.: The principles of verifying the se- mantic correctness of natural language utterances. In the book. : Problems of Bionics.- Kharkov: KSU, no. 32. pp. 64-72 (1984). 11. Golovina, E.A., Terzian, V.Ya.: Express analysis of natural language utterances. In the book: Interactive systems: Materials of the fifth school-seminar.-Tbilisi: Metsniereba, pp. 385-388 (1983). 12. Apresyan, D.Yu.: Toward a formal model of semantics: rules for the interaction of meanings. In the book: Representation of knowledge and modeling of understanding processes.-Novosibirsk: Computing Center of the Academy of Sciences of the USSR, pp. 47-78 (1980). 13. Skorokhodko, D.F.: Semantic networks and automatic text processing.-Kiev: Naukova Dumka (1983). 14. Terzian, V. Ya.: Theoretical and experimental study of the problem of semantic anal- ysis of natural language utterances: dis. cand. tech. Sciences: 05.13.01 "Technical cy- bernetics and information theory" Terziyan Vagan Yakovlevich; Kharkiv. Institute of Radio Electronics. - Kharkov (1984). 15. Harris, L.R.: Using a Data Base as a Semantic Component to Aid in the Parsing of Natural Language Data Base Querries. In: Journal of Cybernetics, v. 10, No. 1-3, pp. 77-96 (1980).