Identification of the Author's Idea Based on the Modified TextRank Method Yuliia Hlavcheva, Olga Kanishcheva, Мaryna Vovk and Maksym Glavchev National technical University “KhPI”, 2 Kyrpychova str., Kharkiv, 61002, Ukraine Abstract Taking into account the significant rate of new information volume formation, including scientific, semantic analysis of the text continues to be a relevant area of research. The results of this paper on extraction the author's idea can be applied to identify features of intellectual plagiarism to promote academic integrity. To identify the author's idea as the main content of the text, the authors use semantic and graph-based methods. The paper proposes a method for identifying the author's idea based on the modified TextRank algorithm. This method takes into account the pronominal anaphoric connections between sentences, allows to form a more complete description of the semantic relationships between sentences in the text. An experiment was carried out on scientific texts in the Ukrainian language, which confirmed an increase in the number of semantic links between sentences in comparison with the simple TextRank algorithm, which affects the weight of sentences and their order in the abstract. Keywords 1 Extractive summarization, text summarization, TextRank, academic plagiarism, similarity of ideas, sentence similarity. 1. Introduction Identification of the author's idea is a complex scientific task. The author's idea is the main content of the text. Thus, the task of automatic abstracting is close to the task of identifying the author's idea. For the task of automatic abstracting it is necessary to obtain a brief description of the main content of the document. Natural language processing methods are widely used for text abstracting, which reduce the original amount of text compared to the input and highlight only important information from the original text. Different statistical, graph and deep learning methods are used for abstracting [1, 2]. Among the popular statistical methods are TF-IDF, TextRank, PageRank, Latent Dirichlet Allocation (LDA), and others. Articles [3, 4, 5] describe the use of graph methods. The document under study is represented as a graph in which the vertices are sentences and the arcs are the connections between sentences. The authors [3] used a modified PageRank algorithm. According to this algorithm, each node (sentence) of the graph has an initial score, which is formed based on the number of nouns in this sentence. According to the authors [3], more nouns in a sentence mean that it contains more information, so it is the nouns that are used as the initial rank of the sentence. The summary of the above modified PageRank algorithm includes sentences that contain the most information and are well semantically related. The authors proposed and investigated the use of the modified TextRank method in the article [5]. It is based on the PageRank algorithm. The proposed method forms a graph with vertices-sentences, which takes into account the similarities between the two sentences. Modified inverse sentence COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine EMAIL: yuliia.hlavcheva@khpi.edu.ua (Yu. Hlavcheva); kanichshevaolga@gmail.com (O. Kanishcheva); Maryna.Vovk@khpi.edu.ua (М. Vovk); maksym.glavchev@khpi.edu.ua (M. Glavchev) ORCID: 0000-0001-7991-5411 (Yu. Hlavcheva); 0000-0002-9035-1765 (O. Kanishcheva); 0000-0003-4119-5441 (М. Vovk); 0000-0001- 9670-9118 (M. Glavchev) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) frequency-cosine similarity is used to give different weightage to different words in the sentence, whereas traditional cosine similarity treats the words equally. The graph is formed sparsely and is divided into different clusters from the position that the sentences in the cluster are similar to each other, and the sentences from different clusters have differences. The authors [5] demonstrate the effectiveness of the proposed method of abstracting. Comparative analysis of TF-IDF, TextRank, Latent Dirichlet Allocation methods is presented in the publication [4]. To determine the best of them, the study was conducted on three different data sets. To evaluate the performance of the methods, the F-measure indicator and ROUGE and Recall indicators were used as a criterion of accuracy. In general, TextRank showed better results compared to TF-IDF and LDA. A detailed review of abstracting technologies is described in review publications [4, 6, 7, 8]. Although each new study increases the quality of the results, there are many aspects that affect the quality of the abstract and which are not taken into account in modern methods of automatic abstracting. The review [6] indicates that without the use of natural language processing methods in the generated summary, the semantic integrity of the content may be violated, it may be unbalanced. The authors of the review note that one of the important points of the study is to determine the optimal correct weight of individual elements (vertices (sentences), arcs (connections between sentences)). It depends on the quality of the summary. In general, all abstraction methods are divided into two major classes: Extractive Summarization and Abstractive Summarization [6, 9, 10]. To determine the elements of academic plagiarism (similarity of ideas), it makes sens to focus on the study of extractive methods of abstracting. Their main task is to extract the most significant fragments of the input text. Abstract methods also form an abstract based on grammar, semantic rules, etc., which allows the program to generate new text other than the input. Thus, the class of abstraction methods is not suitable for determining the signs of academic plagiarism, because it makes additional "noise" when forming a new text. Text abstracting of high quality is characterized by the following components:  relevance – the abstract contains the most important and relevant information, the selected sentences should be closely related to the main content of the source;  content coverage – the abstract should cover as many important aspects of the source document and should minimize the loss of information in the review process;  variety – the abstract should be short and contain as few minor ("extra") sentences, ie two sentences with the same meaning should not be selected when forming the abstract;  resume length – usually determined by the user. In our case, the optimal length of the abstract depends on the text characteristics (type, size, etc.) and can be determined by experiments. Thus, to create an abstract, it is necessary to select a subset of sentences from each studied academic document so that the created abstract contains the main idea of the document and meets the above requirements. Each researched academic document is divided into a subset of sentences. The weight of a sentence is determined based on the analysis of the words used in it. In this case, the same sets of words are used in several sentences from the document. This feature affects the weight of words, sentences, connections between sentences and must be taken into account when forming an abstract. In this paper, we consider the application of automatic abstracting algorithms for the problem of identifying the author's idea. This is a relevant area, as it can be used to detect elements of academic plagiarism. After all, "academic plagiarism is not reduced to textual coincidences, but may also relate to incorrect borrowing of facts, hypotheses, numerical data, methods, illustrations, formulas, models, program codes, etc.", including ideas [11]. The most difficult to detect is the plagiarism of ideas, or intellectual (hidden) academic plagiarism. 2. Modification of the TextRank algorithm taking into account the pronoun anaphoric connections The authors proposed a method of identifying the author's idea based on a modified method (algorithm) TextRank. This method takes into account the pronoun anaphoric connections between sentences, which allows forming a more complete description of the semantic connections between sentences in the text. 2.1. Modified graph model TextRank As a basic model, the graph model TextRank was used to obtain an abstract of the document proposed in the article [12]. Here is a brief description. We have an undirected graph of the document: 𝐺 (𝑑 ) = 𝐺 (𝑉, 𝐸, 𝑓, 𝑒), where 𝑉 ‒ nodes / vertices of the graph, 𝐸 ‒ the edges of the graph, 𝑓 ‒ knot weight, 𝑒 ‒ edge weight. The elements of the undirected graph have the following content: 𝑉 ‒ sentence, 𝐸 ‒ the relationship between sentences, 𝑒 ‒ the value of semantic similarity between sentences, and 𝑓 ‒ the weight of the node, which is calculated by the principles of the PageRank algorithm [13, 14]. We describe by the formula the value of the weight of the arc between the two vertices i and j of the graph G (d): 𝑒𝑖𝑗 , < 𝑖, 𝑗 >∈ 𝑒 𝑒𝑖𝑗 = { , 0, < 𝑖, 𝑗 >∉ 𝑒 where 𝑒𝑖𝑗 ‒ the value of the arc weight, i, j ‒ vertices of an undirected graph 𝐺(𝑑). If there is a connection, the weight of the arc is used in the calculations, and if the sentences are not connected, then this value is zero. In the original work [12], the relationship between two sentences is defined as the number of common tokens between the lexical representations of two sentences. In this study, the connection is calculated taking into account the greater number of semantic connections based on anaphoric links between sentences. In order to avoid the dominance of long sentences, the normalization factor is used. It determines the ratio of the number of common terms to the length of each sentence [12]. Thus, if we have two sentences 𝑆𝑖 and 𝑆𝑗 , and 𝑁𝑖 is a set of words in a sentence, then the sentence will look like this: 𝑖 𝑆𝑖 = 𝑊1𝑖 , 𝑊2𝑖 , … , 𝑊𝑁𝑖 . According to the studied modified method, the similarity of 𝑆𝑖 and 𝑆𝑗 is defined as: |𝑊𝑘 |𝑊𝐾 ∈ 𝑆𝑖 &𝑊𝑘 ∈ 𝑆𝑗 | 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑆𝑖 , 𝑆𝑗 ) = + 𝑊𝑎 , log(|𝑆𝑖 |) + log(|𝑆𝑗 |) where 𝑊𝑎 – coefficient that reflects the semantic similarity of sentences based on anaphoric references between sentences. Thus, the result is an undirected graph that has weighted arcs and weighted nodes. Schematically, it is presented in the figure 1. 1 8 2 Si 7 3 eij Sj 6 4 5 Figure 1: Schematic representation of the document graph To weigh the graph nodes in the TextRank algorithm, an iterative approach is used, according to the PageRank algorithm [12, 13]. In the first iteration, the nodes are assigned random numbers (0-10 is recommended). Thus, the graph consists of sentences with random weights connected by edges. The weight of the edges depends on their similarity. The formula for calculating the weight of the node rank is as follows: 𝑊𝑗𝑖 𝑊𝑆(𝑉𝑖 ) = (1 − 𝑑) + 𝑑 ∗ ∑𝑉𝑗∈𝐼𝑛(𝑉𝑖) ∑ 𝑊𝑆(𝑉𝑗 ), 𝑊𝑗𝑘 𝑉𝑘 ∈𝑂𝑢𝑡(𝑉𝑗) where 𝑊𝑆 ‒ sentence weight (𝑉𝑖 ), 𝑑 – damping factor(𝑑) = 0,85. The damping factor is a value calculated by Google engineers in their own PageRank system. It ensures that the weights of their nodes are ultimately reduced to a single value. The damping factor can be any from 0 to 1, but 0.85 is generally recommended [13]. At each subsequent iteration, a new calculated rank of one sentence is used the calculation must be repeated for each sentence. The "true" weight of each sentence is found with each subsequent iteration. The recalculation stops when the "acceptable error rate" reaches the accepted value (TextRank uses the "acceptable error rate" of 0.0001) [13]. This error rate is calculated by subtracting the weight of the sentence before and after the recalculation for each sentence. When the value is 0.0001, the scales are close enough to their "true" weight. The 𝑡𝑓-𝑖d𝑓 metric and the cosine measure were used to implement the TextRank method. We form a term-sentence model and a sentence similarity matrix, namely a content-graph join model. Let the input document be represented by a set of sentences 𝐷 = {𝑠1 , 𝑠2 , … , 𝑠𝑛 }, where 𝑛 denotes the number of sentences, 𝑠𝑖 denotes the 𝑖-th sentence in D. In order to form the matrices "sentence-term" and "sentence similarity", each of the sentences should be represented as a vector. A standard Vector Space Model (SVM) is used using the "Bag of Words" approach, which represents the text units of a document as vectors in a single vector space. To weigh and determine the weight of terms, special metrics are usually used based on any important property. The most common and popular of them is the metric 𝑡𝑓-𝑖d𝑓 [4]. This metric combines local and global term weighting: 𝑛 𝑊𝑖𝑣 = 𝑡𝑓𝑖𝑣 × 𝑖𝑠𝑓𝑖𝑣 = 𝑓𝑟𝑒𝑔𝑖𝑣 × 𝑙𝑜𝑔 ( ) , 𝑛𝑣 where weight 𝑊 of the term 𝑡𝑣 in the sentence s is determined by multiplying the term weight 𝑡𝑣 in the sentence 𝑠𝑣 and the total weight of the term 𝑡𝑣 . According to VSM, the sentence is represented as a weight vector 𝑆𝑖 = [𝑤𝑖1 , 𝑤𝑖2 , … , 𝑤𝑖𝑚 ], where 𝑤𝑖𝑣 – term weight 𝑡 in the sentence 𝑠𝑖 . The calculation of the angle between two vectors 𝑆𝑖 = [𝑤𝑖1 , 𝑤𝑖2 , … , 𝑤𝑖𝑚 ] and 𝑆𝑗 = [𝑤𝑗1 , 𝑤𝑗2 , … , 𝑤𝑗𝑚 ] can be obtained as an Euclidean product: (𝑠𝑖 , 𝑠𝑗 ) = |𝑠𝑖 | × |𝑠𝑗 | × cos 𝛼. Thus, the degree of closeness between two sentences is calculated as: (𝑠𝑖 , 𝑠𝑗 ) ∑𝑚 𝑙=1 𝑤𝑖𝑙 𝑤𝑗𝑙 𝑠𝑖𝑚(𝑠𝑖 , 𝑠𝑗 ) = cos 𝛼 = = , 𝑖, 𝑗 = 1,2, … , 𝑛. |𝑠𝑖 | × |𝑠𝑗 | 𝑚 2 𝑚 2 ∑ ∑ √ 𝑙=1 𝑤𝑖𝑙 × 𝑙=1 𝑤𝑗𝑙 The resulting matrix of "sentence similarity" describes the similarity of the presented sentences as objects in Euclidean space. The columns and rows of the matrix are sentences, and their intersection displays the value of the similarity of sentences. The abstract formation is based on the ranking of sentences. The ranking of sentences in the abstract is based on the nodes weight. The abstract is usually much smaller than the main text. The number of sentences selected for the abstract depends on the user's settings, namely the desired length. 2.2. Defining semantic relations between sentences based on pronoun anaphora Analysis of scientific texts has shown that they contain semantic relations between sentences of different types. Semantic relations can be divided into three varieties [15]:  texts with parallel (remote) relations;  texts with serial (chain) relations;  texts with connecting links or with a combined relation. In parallel relations, the sentences are equal. Parallel relation is the use of sentences in which the same word order, the same grammatical forms of the sentence members. The main means of implementing parallel relations is syntactic parallelism. This is when the same or similar construction of sentences, which is often expressed in the same sequence of words, and the unity of temporal forms of verbs-predicates (predicates) [15]. Example of using parallel communication in scientific texts: «Ступінь розвитку Web-простору буде визначатися технологіями роботи з величезним обсягом інформації, що накопились в Інтернет. Web наступного покоління буде характеризуватися переходом від мережі документів до мережі даних, що при необхідності агрегуються в семантично зв'язані документи за допомогою Web-сервісів». ("The degree of Web-space development will be determined by the technology of working with a huge amount of information accumulated on the Internet. The next-generation Web will be characterized by the transition from a network of documents to a network of data, which, if necessary, are aggregated into semantically related documents using Web-services".) Serial or chain relations exist because the complement of the previous sentence becomes the subject in the next sentence. The structural form of this relation is as follows: "complement-subject". Other models of the sentence structure are also widely used: "subject-complement", "complement- object", "subject-subject". The syntactic essence of the chain relation is expressed in these syntactic models, in the syntactic relations between neighboring members of the sentence. This is the internal, structural side of the chain relation. There are ways to embody syntactic relations in a serial relation [15]: lexical repetition, synonyms, indicative words, personal pronouns, pronoun adverbs, conjunctions, verbal omission, etc. According to mentioned above, we can distinguish: 1) chain relation by lexical repetition, 2) chain synonymous relation, 3) chain pronoun relation. Pronouns are thought to combine sentences more closely than repetition or synonymous vocabulary. Chain pronouns are extremely diverse. Personal pronouns, personal pronouns in the sense of possessive and actually possessive, indicative pronouns take part in their organization [16]. Some styles of chain relations are especially distinctive to scientific texts. In academic works, we observe a clear sequence and close relation of separate parts of the text, separate sentences, where each subsequent one is connected with the previous one. Presenting the material, the author consistently moves from one stage of reasoning to another. And this method is most consistent with the chain relation. Thus, we distinguish the following means of relation: similar words, pronouns, adverbs, numerals, and other means, repetition of words, words that indicate the sequence of content development. Based on this, the following types of anaphora are distinguished: pronoun, noun, adverb, and zero [17]. One of the most commonly used relations is a relation between the anaphoric pronoun and the antecedent. "Anaphoric pronouns are pronouns that refer to some word or phrase (antecedent) of this text, the semantic meaning of which they reflect" [18]. For example, «Найбільш вираженим у плані динаміки є, безперечно, сегмент інформації у вигляді новин. З одного боку, він має найвищий рівень оновлення, а з іншого боку - у ньому генеруються і поширюються насправді великі обсяги даних». ("The most pronounced in terms of dynamics is, of course, the segment of information in the form of news. On the one hand, it has the highest level of updating, and on the other hand - it generates and distributes really large amounts of data.”) We single out personal anaphoric pronouns for research. third-person pronouns are most often present in scientific texts of all personal pronouns. Thus, we investigate the definition of anaphoric relation by the following selected types of constructions. The construction of the first type with the noun in the singular with identical features for identification: «Фінансово-економічна криза має протилежні форми прояву у суспільстві. З одного боку, вона оновлює механізми господарювання, а з іншого призводить до зростання соціальної напруженості у суспільстві». ("The financial and economic crisis has opposite forms of manifestation in society. On the one hand, it renews the mechanisms of management, and on the other hand, it leads to an increase in social tensions in society.") The antecedent "криза" and the anaphora "вона" have identical morphological characteristics: singular, feminine, nominative case. The construction of the second type with a noun in the singular with distinctive features for identification: «Експериментальна робота проводилася протягом 2006-2010 років. Основними завданнями її було визначення структурної моделі управлінської компетентності для інженерів-керівників електромашинобудування». (“Experimental work was carried out during 2006-2010. Its main tasks were to determine the structural model of managerial competence for engineers-managers of electrical engineering.”) The antecedent "робота" and the anaphora "її" have identical morphological characteristics: singular, feminine. The difference for this pair is the grammatical case. The construction of the third type with a noun in the plural with identical features for identification: «Це означає, що на наших ринках господарюють зарубіжні компанії. І саме вони, за наш рахунок, вкладають гроші у власний розвиток науки, техніки, створюють додаткові робочі місця». (“This means that our markets are managed by foreign companies. And they, at our expense, invest money in their own development of science and technology, create additional jobs.") The antecedent "компанії" and the anaphora "вони" have identical morphological characteristics: plural, nominative case. The construction of the fourth type with nouns in the plural with signs for identification that differ: «Процеси докорінної зміни соціально-виробничих відносин, що відбуваються в нашому суспільстві, не обминають і технічні університети. Суспільство вимагає від них підготовки компетентних фахівців у спеціальній і психолого-педагогічній галузях, зокрема сформованості управлінської компетентності». (“The processes of the radical change of social and industrial relations taking place in our society do not bypass technical universities. Society requires them to train competent specialists in special and psychological and pedagogical fields, in particular the formation of managerial competence.”) The antecedent "університети" and the anaphora "них" have the same characteristic, they are plural and differ in cases. When solving an anaphora, it is important to identify, based on syntactic and morphological information, the characteristic features by which the antecedent is identified in relation to the anaphora. Antecedent and anaphora should have similar characteristics. 90% of antecedents are in the same or previous sentence with anaphora [19]. In our case, the anaphora and the antecedent must be in different sentences. To solve the anaphora (identification of the pair anaphora - antecedent) different methods are used: system analysis, construction of the classifier, machine learning algorithms, and others. Approaches to solving this problem are described in the following publications [17, 20, 21, 22]. The authors of this article conducted a study of the solution of anaphoric references for structures of the first type. The article [23] describes the process of identification of the pair anaphora - antecedent. Using the mathematical apparatus of the algebra of finite predicates, a logical network is constructed to identify an anaphoric relation based on seven features. To improve the process of determining the semantic similarity of sentences informative features are optimized. To identify the chain relation for all of the above types of structures four features were used. They are listed in Table 1. Table 1 Signs of a chain relation № Name of attribute Notation Attribute value 1 Part of speech is a noun X1 = {x11 , x12 , x13 } x11 − noun x12 – other part of speech x13 − not specified 2 Grammatical gender X2 = {x12 , x22 , x23 , x24 } x12 − masculine gender x22 − feminine gender x23 – neutral gender x24 − not specified 3 Number X3 = {x13 , x32 , x33 } x13 − singular, x32 – plural, x33 − not specified № Name of attribute Notation Attribute value 4 Grammatical case X4 = {x14 , x42 , x43 , x44 , x45 , x46 , x47 } x14 – Nominative, x42 – Genitive, x43 – Dative, x44 – Accusative, x45 – Ablative, x46 – Prepositional, x47– not specified The text is a set of sentences. In the course of our algorithm, first, every two consecutive sentences (in pairs) are checked for the presence of a semantic relation. It should be remembered that the anaphora is in the second sentence of the pair, and the antecedent is in the first one. To determine the potential anaphora in the second sentence of the pair is a search for personal pronouns «він», «вона», «воно», «вони» ("he", "she", "it", "they"), and their declension forms. If there is a potential anaphora, then to determine the potential antecedent in the first of a couple of sentences search for words related to nouns (𝑋1 ). For potential anaphora and antecedent, morphological features (𝑋2 , 𝑋3 , 𝑋4 ) are determined and checked for compliance. If the morphological features coincide (according to a certain type of construction) then the anaphora-antecedent pair is identified and the semantic relation is confirmed. For example: «<…> 1. Оцінювання ефективності зовнішньої реклами – не пряме і досить складне. (Evaluating the effectiveness of outdoor advertising is not direct and quite complex.) 2. Воно виконується шляхом визначення кількості потенційних рекламних контактів через оцінювання потенційної аудиторії конкретного місця знаходження реклами.( It is performed by determining the number of potential advertising contacts through the evaluation of the potential audience of a particular location of advertising.) <…>». In the second of a couple of sentences, the third-person pronoun for the role of a potential anaphora is identified as «воно» ("it"). The first sentence in a pair identifies nouns for the role of potential antecedents – «реклами», ("advertising"), «оцінювання» ("evaluation"). After determining and comparing morphological features, we can identify a pair of anaphora – antecedent: «воно» – «оцінювання» ("it" is "evaluation"). Therefore, we have a confirmed semantic relation between these sentences. For the description of our model, we used the mathematical apparatus of the algebra of finite predicates and comparative identification. This mathematics was used to describe the process of chain identification based of these features. Thus, semantic relations are determined based on anaphoric references, which could not be detected based on tf-idf. Newly defined semantic relations are taken into account when forming the "sentence similarity" matrix, and when calculating the value of the relation between sentences. 3. Experiments and Results The research is conducted on our own scientific text corpus in the Ukrainian language. The source of data is the repository of the National Technical University "Kharkiv Polytechnic Institute" (http://repository.kpi.kharkov.ua) and the portal of scientific publications of the National Technical University "Lviv Polytechnic" (http://science.lpnu.ua/uk). The total number of authors is 32, the number of documents (individual publications) is 271 (Table 2). 5.5% of the total words in the text corpus are pronouns (23255 out of 415565). These data indicate the frequent use of pronouns. Anaphoric pronouns most often include personal pronouns of the third- person, indicative, inverse, relative pronouns [24, 25]. The presence of pronouns in the Ukrainian language documents of the text corpus is presented in Table 3. In this research, we analyze not all document genres, but only scientific. For scientific texts, some style of chain links is especially characteristic. In academic works, we meet with a clear sequence and close connection of separate parts of the text, separate sentences, where each subsequent one is connected with the previous one. Presenting the material, the author sequentially moves from one stage of reasoning to another. In addition, in this way it is most consistent with the chain link. Table 2 The main statistical corpus indicators Name of attribute Attribute value Number of authors 32 Number of documents 271 Total size (tokens) 415565 The average number of tokens in the document 1533 The total number of sentences 24743 The average number of sentences in a document 91 The total number of pronouns 23255 The average number of pronouns in the document 86 Table 3 Types of pronouns and their number (text corpus, Ukrainian language) Type of pronoun Number Indicative 12581 Personal 5319 Relative 4322 Defining 1025 Appropriative 8 Another feature of the structure of the scientific language is that the chain connection of sentences is carried out, as a rule, at the place of their connection. It is especially important to emphasize the position of the repeating member of the sentence at the beginning of the next sentence. Thanks to this, continuity and consistency of reasoning is achieved. Each time at the beginning of a new sentence, the opinion seems to return to the main element of the previous sentence, which becomes the starting point for the development of thought in a new sentence. Figure 2 shows how many pronouns are contained in our corpus, which allow us to organize the connection between sentences and continue the author's idea. It 2 shows the most common pronouns (TOP-10) which were found in our corpus. Pronouns from the TOP-10 make up 40% of the total number of pronouns. Figure 2: The most common pronouns (TOP-10) Let us consider examples of sentences from our corpus, between which semantic relations were established using the algebra of finite predicates and comparative identification. «Електроенергія є основною галуззю національної економіки, стабільність якої має особливе значення для розвитку країни. Вона впливає не тільки на розвиток національної економіки, але і на територіальну організацію продуктивних сил». ("Electricity is the main sector of the national economy, the stability of which is of particular importance for the development of the country. It affects not only the development of the national economy, but also the territorial organization of productive forces”.) Our method identified the pronoun "Вона" and found the corresponding antecedent "Електроенергія". Experiments have shown that semantic relationships were found in our corpus using the pronoun anaphora with an accuracy of 96%. Cases, where the model doesn’t work, are related to the syntactic features of the sentences, errors in the morphological analyzer, and the distance between the anaphora and the antecedent. Let's analyze a fragment of the original article (Doc 1), which consists of about 1463 words and 103 sentences. The fragment contains 58 third-person pronouns in Doc 1. The statistics are presented in Table 4. Table 4 Statistics of third-person pronouns in Document 1 Pronoun Document 1 Pronoun Document 1 Pronoun Document 1 нього (Genitive or вона (she) 3 йому (him) 1 1 Accusative case of he) нею (Ablative case ній (Prep. case вони (they) 3 1 of she) 2 of she) неї (Genitive or їх (Genitive or воно (it) 3 Accusative case of 1 Accusative 10 she) case of they) ними (Ablative case він (he) 4 2 її (her) 11 of they) них (Genitive, його (his) 6 Accusative or Prep. 10 case of they) 16 of the total number of pronouns belong to our model. Thus, according to the overall result, the value of semantic similarity increases for 16 pairs of sentences. The number of defined sentence pairs with anaphoric reference (additional semantic relations) to the total number of sentences in the text for 5 documents is presented in Table 5. Table 5 Statistics of sentence pairs with anaphoric references to the total number of sentences Number of third Pairs of sentences Number of Number of person pronouns with a solved Document words sentences anaphora Document 1 1463 103 58 16 Document 2 2602 580 55 7 Document 3 1516 195 32 8 Document 4 2605 706 39 5 Document 5 1650 314 45 18 Although it can be seen from the Table 5 that there are not so many such pairs of sentences in relation to the total number of sentences, however, these connections qualitatively affect the determination of the similarity of two sentences, which improves the process of obtaining an abstract of the document. In this paper, we proposed method identified semantic relations between sentences that were not identified by other methods. This affects the ranking of sentences in the abstract. For the final and detailed analysis of the results and their comparison with existing methods, it is planned to prepare a special text corpus. Calculating the accuracy of the method of the author's idea determining also requires a special corpus, which will contain not only texts but also their main abstracts (ideas). Due to the current lack of such a corpus, this is a task for further research. 4. Conclusions and Recommendations The task of automatic abstracting is still relevant and not completely solved. In our work, we believe that the abstract demonstrates the main content of the author's text and doesn’t contain "noise", because it consists of the most important sentences of the text. Therefore, the proposed method can be used to identify features of intellectual latent plagiarism and provide the expert with additional information about the "idea" of the text for analysis. The presented modified TextRank method takes into account anaphoric references with third- person pronouns. This is an important aspect to determine all the semantic relations between sentences in the text when forming an abstract. In future works it is planned to determine the value of the modified TextRank method accuracy. To do this, a special text corpus will be prepared, focused on this task. Thus, it can be concluded that the modification of the TextRank graph method described in the article allows obtaining a document abstract, which takes into account a greater number of semantic relations between sentences, compared to the simple TextRank method. Due to the solution of anaphora, the use of predicate algebra and predicate operations has demonstrated a successful application for determining the pronoun anaphora within the TextRank method. 5. References [1] P. G. Magdum, S. Rathi, A Survey on Deep Learning-Based Automatic Text Summarization Models, in: Chiplunkar N., Fukao T. (Eds.), Advances in Artificial Intelligence and Data Engineering. Advances in Intelligent Systems and Computing, Springer, Singapore, vol 1133, 2021. doi:10.1007/978-981-15-3514-7_30. [2] Kouris Panagiotis, Georgios Alexandridis, Andreas Stafylopatis, Abstractive Text Summarization Based on Deep Learning and Semantic Content Generalization, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistic, 2019, pp. 5082– 5092. [3] Elbarougy Reda, Gamal Behery, Akram El Khatib, Extractive arabic text summarization using modified PageRank algorithm, Egyptian Informatics Journal 21(2) (2020) 73–81. [4] Rani Ujjwal, Karambir Bidhan, Comparative Assessment of Extractive Summarization: TextRank, TF-IDF and LDA, Journal of Scientific Research 65(1) (2021). doi:10.37398/JSR.2021.650140. [5] Mallick Chirantana, Ajit Kumar Das, Madhurima Dutta, Asit Kumar Das, Apurba Sarkar, Graph- based text summarization using modified TextRank, in: Soft computing in data analytics, Springer, Singapore, 2019, pp. 137–146. doi:10.1007/978-981-13-0514-6_14. [6] Gupta Vishal, Gurpreet Singh Lehal, A survey of text summarization extractive techniques, Journal of emerging technologies in web intelligence 2(3) (2010) 258–268. doi:10.4304/jetwi.2.3.258-268. [7] Tas Oguzhan, Farzad Kiyani, A survey automatic text summarization, PressAcademia Procedia 5(1) (2007) 205–213. [8] Mahajani Abhishek, Vinay Pandya, Isaac Maria, Deepak Sharma, A comprehensive survey on extractive and abstractive techniques for text summarization, Ambient Communications and Computer Systems 904 (2019) 339–351. doi:10.1007/978-981-13-5934-7_31. [9] B. S. Prakash, K. V. Sanjeev, R. Prakash, K. Chandrasekaran, M. V. Rathnamma, V. V. Ramana, Review of Techniques for Automatic Text Summarization, in: Proceedings of the Third International Conference on Computational Intelligence and Informatics, Springer, Singapore, 2020, pp. 557–565. doi:10.1007/978-981-15-1480-7_47. [10] V. Soni, L. Kumar, A. K. Singh, M. Kumar, Text Summarization: An Extractive Approach, Soft Computing: Theories and Applications, Springer, Singapore, 2020, pp. 629–637. doi:10.1007/978-981-15-4032-5_57. [11] Do pytannya unyknennya problem i pomylok u praktykakh zabezpechennya akademichnoyi dobrochesnosti, Lyst Ministerstva osvity i nauky Ukrayiny vid 20.05.2020 № 1/9-263, 2020. URL: https://mon.gov.ua/ua/npa/do-pitannya-uniknennya-problem-i-pomilok-u-praktikah- zabezpechennya-akademichnoyi-dobrochesnosti. [12] R. Mihalcea, Graph-based ranking algorithms for sentence extraction, applied to text summarization, in: Proceedings of the ACL interactive poster and demonstration sessions, 2004, pp. 170–173. [13] R. Mihalcea, P. Tarau, Textrank: Bringing order into text, in: Proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 404–411. [14] M. Bianchini, M. Gori, F. Scarselli, Inside pagerank, ACM Transactions on Internet Technology (TOIT) 5(1) (2005) 92–128. [15] A. I. Vavilenkova, Syntez lohiko-linhvistychnykh modeley rechen' pryrodnoyi movy yak zasib pobudovy zmistovnoyi modeli tekstu, Systemy pidtrymky pryynyattya rishen', Teoriya i praktyka, Kyyiv, IPMMS NANU, 2013, pp. 49–51. [16] Ukrayins'kyy pravopys. URL: http://pravopys.mova.info/pravopys.aspx?SectionID=1457. [17] T. H. Voznyuk, Pobudova klasyfikatora dlya vyrishennya zaymennykovoyi anafory na osnovi tenzornoyi modeli, Visnyk Kyyivs'koho natsional'noho universytetu imeni Tarasa Shevchenka, Seriya: Fizyko-matematychni nauky 2 (2015) 113–116. [18] T. H. Voznyuk, Zastosuvannya keruyuchoho prostoru syntaksychnykh struktur pryrodnomovnykh tekstiv dlya vyrishennya problemy anafory, Visnyk Kyyivs'koho natsional'noho universytetu imeni Tarasa Shevchenka, Seriya: Fizyko-matematychni nauky 2 (2014) 100–103. [19] J. R. Hobbs, Resolving pronoun references, Lingua 44(4) (1978) 311–338. [20] Rhea Sukthanker, Soujanya Poria, Erik Cambria, Ramkumar Thirunavukarasu, Anaphora and coreference resolution: A review, Information Fusion 59 (2020) 139-162. doi:10.1016/j.inffus.2020.01.010. [21] V. Yu. Dudnyk, Vykorystannya systemnoho analizu dlya rozv'yazku anafory pryrodomovnykh tekstiv dlya ukrayins'koyi movy, Naukovi notatky 65 (2019) 67–73. [22] N. Herysh, Semantychni vidnoshennya v anaforychnykh konstruktsiyakh imennyk–zaymennyk, Naukovyy visnyk Skhidnoyevropeys'koho natsional'noho universytetu imeni Lesi Ukrayinky, Filolohichni nauky, Movoznavstvo 5 (2014) 151–155. [23] Y. Hlavcheva, O. Kanishcheva, M. Vovk, Identifying Semantic Relations Between Sentences by Solving an Anaphora, Information Processing Systems 3(162) (2020) 36–43. doi:10.30748/soi.2020.162.04. [24] O. A. Bakun, Sehmentni konstruktsiyi z nazyvnym uyavlennya: semantychnyy analiz, Naukovyy chasopys Natsional'noho pedahohichnoho universytetu imeni M. P. Drahomanova, Seriya 10: Problemy hramatyky i leksykolohiyi ukrayins'koyi movy 7 (2011) 118–122. [25] Lynhvystycheskyy entsyklopedycheskyy slovar', in: V. N. Yartseva (Ed.), 2nd. ed., Moskva, 2002.