=Paper=
{{Paper
|id=Vol-2870/paper17
|storemode=property
|title=Automation of Template Formation to Identify the Structure of Natural Language Documents
|pdfUrl=https://ceur-ws.org/Vol-2870/paper17.pdf
|volume=Vol-2870
|authors=Olena Kuropiatnyk,Viktor Shynkarenko
|dblpUrl=https://dblp.org/rec/conf/colins/KuropiatnykS21
}}
==Automation of Template Formation to Identify the Structure of Natural Language Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-2870/paper17.pdf</pdf>
<pre>
Automation of Template Formation to Identify the Structure of
Natural Language Documents
Olena Kuropiatnyk and Viktor Shynkarenko
  Dnipro National University of Railway Transport named after Academician V. Lazaryan,
  2, Lazaryana str., Dnipro, 49010, Ukraine


                 Abstract
                 In the task of text borrowings and plagiarism detection, it is important to take into account
                 the structure of the document. This allows getting a more accurate assessment of the text and
                 reducing the volume of material for comparison. Using a template allows identifying the
                 structure of the document. The paper presents a constructive synthesizing model for
                 automating the construction of a structural template of a document. Possible implementations
                 of some algorithms by means of programming in C# are considered. Their comparative
                 assessment is performed. Possible modification of the template is presented to increase the
                 importance of keywords and simplify the xml-tree, which is a template.

                 Keywords 1
                 Natural language, document comparison, plagiarism detection, document structure, document
                 template, constructive-synthesizing modeling, constructor

1. Introduction
    A document checking for matches is actual in the tasks of plagiarism detection, rewriting,
information retrieval, etc. when a document processing to detect matches and text borrowings, the
structure of the document has to be taken into account. This is due to the fact that the selection of the
comparison base for the content of the section will not only reduce the probability and number of
random matches at the level of words and separated phrases, but also reduce the volume of materials
to be compared. In addition, part of the text borrowings may be admissible given that the given
material is necessary for the completeness of the image and the author of the submitted for
verification document does not claim authorship of this part. Such borrowings can be found, for
example, in articles, abstracts, parts of qualifying works, which have a review character.
    A special method has been developed to compare documents taking into account their structure by
means of constructive modeling [1]. It involves comparing documents according to the structural
template and a constructive-synthesizing model of a graph representation of the text [2]. This idea is
developed in the paper: the issues of automation of template construction with the possibility of
editing and append new structural elements are considered.

2. Goal and questions of the research
   The goal of this research is modeling and automatizing of the process of forming a structural
template of a document based on a selection of natural language digital documents. This template will
be used to select information for comparing text documents to detect borrowings.
   Research questions:
       identification elements that indicate digital representation of a document structure;

COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine
EMAIL: olena.kuropiatnyk@gmail.com (O. Kuropiatnyk); shinkarenko_vi@ua.fm (V. Shynkarenko);
ORCID: 0000-0003-2286-884x (O. Kuropiatnyk); 0000-0001-8738-7225 (V. Shynkarenko);
                 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
      determination the main components of the process of forming a structural template of the
   document;
      development of a constructive-synthesizing model of structural template forming;
      software implementation of the constructive-synthesizing model of structural template
   forming for automation of template formation to identify the structure of natural language
   documents.

3. Methods
    Means of constructive-synthesizing modeling [2] are used for solve the questions of the research in
this paper. The modeling is based on the apparatus of formal languages and grammars. Usage a
formalized object of constructor and methods of its transformation allows creating the model of
structural template forming and automate corresponding processes. Methodology of object oriented
modeling and programming, methods of processing xml-nodes and their attributes which
implementation in C# and components from namespace Microsoft.Office.Interop.Word are used for
software implementation of the constructive-synthesizing model of structural template forming.
Method based on SR-estimation (described in this work) is used for evaluation the some algorithms
time efficiency of developed model software implementation.

4. Definition of structural features of the document. Review of related works
    To compare documents taking into account their structure, let’s define the concept of a structured
document, a structural element and the features by which it can be determined.
    Structured digital documents are documents represented by doc / docx files format that have a
logical structure in content and the appropriate formatting. Logical structure determinates sections and
subsections order [1]. Structure items are units. They can be identified by the following features:
        formatting. There are use appropriate styles: headings or other styles that indicate a non-zero
    paragraph level;
        title and alternative titles;
        sub-units with corresponding names and attributes format;
        keywords in the unit text.
    The simplest identification of the structure in terms of software implementation is the use of the
first two features. The first three have already been partially used by the authors earlier [1]. But given
the possible paraphrasing of the titles will be more accurate with using the keywords.
    The process of finding keywords consists of three stages: pre-processing (depending on the text
language), search (extraction) of keywords, get a list of words in the format required by the user
(possible reduction of the sample) [3].
    The tasks of pre-processing include the implementation of actions:
        "cleaning" of the text are the mechanical actions that do not require knowledge of the
    language and understanding of the text: removal of hidden and white text, addresses of cross-
    references, double spaces, reduction to a single register, etc.;
        deletion of stop words;
        reducing the impact of the use of inflections by stemmatization, lemmatization.
    The latter tasks are language-dependent, and therefore the vast majority needs dictionaries. An
alternative is to use the Porter algorithm [4, 5] and similar, adapted for the Ukrainian language.
    It should be noted that pre-processing is also useful when working with the titles of structural
elements of the document.
    Keyword search algorithms can be divided into two categories: purpose (selection of those in the
dictionary) and extraction (selection directly from the text) [6]. The latter are valuable for this work.
    Keyword recognition can be done by statistical and / or structural methods. Among the first widely
used was the use of TF * IDF is a statistical measure of the frequency of occurrence of the word in the
document. Authors of work [6] propose a model based on Key phrase Extraction Algorithms (KEA),
which uses such features as TF * IDF and the distance from the beginning of the document to the first
meeting. To select keywords, the KEA (Bayesian) classifier is used, which determines the weight
assigned to a potential keyword and, based on it, marks the words as "key" and "non-key". Based on
these data a model is built, which assumes classes (key, non-key) for a word or phrase, depending on
the value of the calculated features.
   As for structural methods, there are based on graphs and templates.
   Keywords can be extracted based on the graph model of the linguistic corpus [7]. When defining
keywords, a sample of the highest frequency (TF * IDF) of about 40 words is created - these words
are potential vertices of the graph. Then choose the 20 words that occur in the largest number of
sentences. A graph is constructed from these words: vertices are extracted words; an edge between
vertices exists if the corresponding words occur in one sentence. The multiplicity of an edge indicates
the number of such sentences. Then select 20 words that correspond to the vertices of the graph with
the greatest degree - they are taken as key. The size of the selection by the number of words can vary.
   Approaches to the formation of a graph for whole or filtered text (TextRank, Rake, DegExt), are
considered in [3]. New implementations of keywords extraction algorithms are formed on their basis.
TextRank is one of the most popular algorithms [8] and works by analogy with PageRank. Its main
idea is graph forming for a text. In the keyword identification task, vertices are words, and for
annotation and abstracting tasks vertices are sentences. For each vertex, the rating is calculated by the
number of edges that end in the given vertex one and by the edge rating. The connection between
sentences is determined by the presence of identical words in them, and the weight of the edge by
their number. Thus, the most important are those sentences that have information from several others.
   Authors of work [9] use the TextRank algorithm using an additional database to complement the
short text with information to reduce the number of cases of incorrect word detection. The usage of
Word2Vec is proposed for an automated calculating of the similarity between words by their vectors
(these vectors fix the semantic features of the word) [10]. That determined the weight of the edges
between the vertices based on the similarity of the vertex words and improved keyword detection
compared to TextRank by reducing the number of false positives. It should be noted that the use of
the Word2Vec model is common in the task of extracting keywords [11, 12, and 13]. The use of
several indicators [14] to determine the weight of the graph edge (frequency of a pair of words and
each of the pair) and the selection of candidates for keywords has increased the effectiveness of
keyword detection. To identify candidates, the following was taken into account: the distance from
the central node, the average weight distribution in the links of one node, the importance of
neighboring nodes, the position of the node (especially important for short texts), word frequency.
   Thus, we can conclude that the most promising approach to keyword extraction is the construction
and analysis of the graph. The accuracy and number of false positives can be improved by using not
only statistical but also semantic features of words.
   Further in the work the construction of a document template on such features as titles of structural
elements, format, keywords is considered.

5. Results and discussion
5.1. Research of digital representation of document structure
    Defining the structure of the document involves the formation of some tree-like structure that
reflects its units - sections and subsections - structural elements. One of the approaches to determining
the structure of the document is the analysis of its formatting and separation of structural units based
on the titles of sections, subsections.
    Possible options for designating section and subsection headings are:
        embedded header styles;
        properties of the paragraph "level" without the use of styles;
        properties of the paragraph "level", which is determined by the style created by the user, and
    also on the basis of embedded style;
        selection according to the requirements for the preparation of reporting documentation:
    capital letters, from a new line, in the center, from a new page (with or without a gap), the
    appropriate numbering, etc.
   To determine the most common technique to identify structural elements, the xml structure of doc
/ docx documents with different techniques of designating structural element titles was analyzed. The
general structure is shown in Figure 1. The following tags are used for marking a document: <w:p>
represents a paragraph, <w:pPr> sets the paragraph properties, <w:rPr> sets the range properties,
<w:style> sets the style attributes, <w:lang> sets a language, <w:outlineLvl> sets a paragraph level,
<w:r> represents a range in a paragraph, <w:t> represents the text displayed in the range. Italics
indicate the attributes w: val, the value of which allow determining the selection method, given the
node to which it belongs. In Figure 1a is the number of standard header style,                ̅̅̅̅̅. In Figure
1b N is paragraph level,          ̅̅̅̅̅, zero corresponds to the first. In Figure 1c StyleTitle is the title of
the style created by the user and applied to this paragraph. The Unit title is the value of tag w:t and w:t
the title of the header of the structural element of the document.


                 a                                  b                                  c
Figure 1: The paragraph structure of the sectional title of the using: a – standard heading styles, b –
notation of the paragraph level without style, c – notation of the level with style

   A more detailed study of the xml-structure showed that the subtrees shown in Figure 1, have a
common parent node, regardless of the method of marking of the section and its level. It is <wx:sub-
section>. An example is shown in Figure 2. Section 1, Section 1.1 are titles of section and subsection,
First Section Text, Subsection Text are the texts they contain. In the example, there are two techniques
to denote section and subsection title: by using a custom style with a paragraph-level label and by
using a standard heading style. Also in this example, it is shown that only paragraphs with special
formatting add the <wx: sub-section> tag to the xml structure. To verify that the appearance of the
<sub-section> tag depends solely on the structure of the document, paragraphs were created and
formatted using a user style: in CustomStyle the paragraph level is one, and in CustomMainTextStyle
the paragraph level is set as main text (no level). It is confirmed that the appearance of this tag is due
solely to the structure.
   The use of the identified features is the basis of the process of constructing a structural tree of the
document. On the base of this tree a template is formed and updated

5.2.    Forming a document template based on xml-parsing
   The document template allows you to determine the content of the text of the section submitted for
checking to borrowings detection. The template includes a description of the sections. The description
of each section contains the title of the section, the sets of alternative titles and keywords, a set of
subsections. Each subsection can have a description similar to the section, but only the title is
required.
   Let’s build a model to form a template. This model is a constructor, whose task is to form a
document template on a set of structured natural language documents. Let’s model the states of a
template constructor (Figure 3). From their analysis we can distinguish the following actions of the
constructor:
       get an xml-representation of the document;
       pre-processing of the document;
       search for a section in the xml-representation of the document;
       adding a section of the document to the template;
       adding a subsection of the document to the template;
       search for keywords in the text of the document section;
       adding keywords to a section in the template;
       search for a section in the template by name;
       replenish the set of section keywords in the template;
       adding an alternative section title in the template;
       search for a section in the template by keywords;
       manual adding a section to the template.


Figure 2: An example of the document structure with different designations of the titles of sections
and subsections

   To formalize the model, we use the apparatus of constructive-synthesizing modeling. It is based on
the constructive properties of formal grammars. The approach is based on the use of a constructor.
The constructor is a triple, which includes a carrier, a signature of operations and a set of statements
of information constructing support that includes the ontology, the purpose, rules, constraints, initial
conditions, and construction completion conditions. To use for a specific subject area the constructor,
the clarifying transformations are performed: specialization ( ), interpretation ( ), concretization
( ), implementation ( ).
   To form a document template lets define the constructor, performing its specialization:
                               〈       〉            〈               〉,                             (1)
where                 is the heterogeneous replenishable carrier, is the terminals set, is the non-
terminals set,      is the signature of operations and relation for construction,      is the set of
statements of information constructing support.


                       a                                                    b
Figure 3: The states of the constructor in the process of forming the template: a – the general
scheme of forming, b – the specified scheme of the state called "go to the next document"

    Ontology of constructor       . Terminals include digital documents, templates, sections and strings:
section titles, alternative section titles, keywords, xml-representation of the document, name of xml-
nodes, text of the document section. A template is a sequence of sections.
    The element of carrier                  has a set of attributes                        . A heterogeneous
multiset of elements with attributes is meant by the set. Belonging of the attribute         to the element
will be denoted as           .
    The attributes of the section are                                  , where      is a title,      is a set of
alternative titles (it can be empty),       is set of keywords,     is set of subsections, each of them is a
section with corresponding attribute set;          can be empty.
    Signature             〈                〉       consists of operations sets and corresponding relationships of
the same name, where                                              are the operations of linking and transforming
carrier elements,                                are the operations of substitution and inference,                  are
relationships and attribute operations,              is the substitution relation.             〈       〉 is the set of
substitution rules, is the sequence of substitution relations,                 is the set of attribute operations. If
attribute operations are not performed, the substitution rule will look like this〈                  〉, where is an
empty symbol. The inference rules apply the relationship from , and the corresponding operations
apply in the implementation of the constructor.
    Let’s specify signature operations:
                       gets data (objects)        from the environment . It is similar to image reception
    operation                  , which involves determining (obtaining) certain code of the image m using
    the channel from the part of the external world [15];
                    is text pre-processing. It involves parsing the xml-representation of document and
    deleting the elements of a given set of nodes . Such nodes indicate hidden text, cross-references,
    white color text, and so on;
                             is concatenation. It connects sections                   so that        precedes        .
    The result is a sequence of sections                    ;
                                is the inheritance with the specification. It provides a redefinition of the
    section            as a result of which it has as attributes        and        , where         is a set of strings,
    which can be the titles or alternate titles, or keywords. It is used to add new attributes to sections
    and subsections. This operation is similar to the operation of inheriting an image [15], because
    when forming a section, determining its components, not only the object is formed, but also the
    corresponding image – a reflection of the object (prototype), its properties and relations on the
    material carrier;
                              is the union of string sets. The result of it is some set              , that contains
    elements                   without duplicate;
                          is search of node in xml-representation of . The results are a level of nesting
    and value of node . To search for sections (subsections)                                              ,         ,
    takes the value of a set of nested tags <w:p> in the found section                                  ;
                             is search for keywords          in the text contained in the set of paragraphs of a
    given section ;
                        is search section in the template by the title ;
                          is search section in the template by keywords              .
    The purpose of the constructor is to form constructions-templates and edit them. The purpose of
the template is to describe the general structure of a particular class of documents. For example, there
will be qualification graduation theses of higher education students, scientific works (articles,
abstracts, essays, etc.), educational works (explanatory notes to course projects, reports on laboratory
works, practice), technical documentation.
    Restrictions on the form templates are imposed by a class of input documents, their quantity; the
complexity of the constructed constructions depends on them.
    Initial conditions for construct:
                                   is the set of documents of a given class for which the template is form,
    is the number of documents on which the template will be form;
                is set of tags for markup digital documents;
                     -            is the name of the tag that represents the sections and subsections in the
    xml-representation of the document;
            is non-terminal from which the interfere begins;
           is an empty template construct that will be filled with sections;
           is the construct of the section for which the attributes will be defined;
            is a set of texts contained in the paragraphs of the section;
             is the text of the first paragraph of the section.
    Termination condition of constructing: the form does not contain non-terminals.
   To determine the algorithms for performing possible operations and relationships on documents,
templates and their components lets interpret the constructor (1) by the algorithmic structure :
      〈        〈             〉        〈           〉〉                 〈               〉,         (2)
where            ,                       is the set of basic algorithms,          ,   are the sets of definitions and
values of an algorithm               ,                        ,               is a set of possible executors of the
model-constructor that can implement algorithms of                       ;              ⋃         ( (    )     (   ))
        , where       is the heterogeneous replenishable carrier,                      is a set of the construct of the
template that satisfies     .
   Constructor            includes performing operations algorithms:
                     for algorithms composition,                 is sequential execution of the algorithm         after
   algorithm ;
           for conditional execution: algorithm executes if the expression b is true;
            for getting data from environment ;
            for preprocessing text;
              for concatenation of sections;
               for inheritance with specification;
               for union of sets;
                 for search of node in xml-representation of ;
                   for search for keywords     in the text of paragraph set      ;
                  for search section in the template by title, the result ( is the section, if the search
   is successful, else           ;
                   for search section in the template by keywords, the result ( is the section, if the
   search is successful, else          ;
                    for substitution;
                 ,         for partial and complete inference, where                    are forms,     is the axiom,
   is the set of constructed structures.
   Axiomatic of linking the operations and algorithms is as follows:                                      ),            :),
            ),                  ),                    ),                     ),                   ),                    ),
                 ),                         ),                      ),                       ),                          ),
                    )}.
    To clarify the input operations, let us perform concretization of the constructor (2):
                           〈                 〉                  〈                〉,                  (3)
where                   ,                                                           , is the terminal set,
is a template, that is a construction in the form of a set of sections, is a construction of the section,
     is the title of the section (it is a string),                are keyword sets: initial and updated,
respectively,                 – are alternative titles sets: initial and updated, respectively,
                                    is the set of non-terminals, is initial non-terminal.
    Let’s consider the operations associated with the formation of the template. Rules             describe
the actions corresponding to the states presented in Figure 3a. Rules               are a preparatory stage,
         correspond to actions in the state "build the structure tree" (Figure 3a). Finding and adding
keywords is described by an alternative right side of rule .
                                    〈                              〉,                                (4)
                                  〈                                  〉,                              (5)
                             〈                            -               〉,                         (6)
                                             〈              〉,                                       (7)
                                          〈               〉,                                         (8)
                                          〈               〉,                                         (9)
                                  〈                               〉.                                (10)
   Consider the rules that describe the actions of the state "go to the next document" (Figure 3a),
which are presented in detail in Figure 3b. Rules           correspond to the states “initialization”, “get
xml-representation” and “clean the text” at the state diagram in Figure 3b.
                                〈                                   〉,                              (11)
                                      〈                       〉.                                    (12)
   Rules            corresponds to the state “analyze a section” that can repeat if the document that
using for update the template has a several sections (Figure 3b)
       〈                                                                                       〉 (13)
                            〈                                           〉                           (14)
   Rule     corresponds to the state called “union keywords”.
                                   〈                               〉                                (15)
   Rule     allows finding a subsection by keywords.
                                    〈                             〉                                 (16)
   The next rule allows adding an alternative title for exist section
                                〈                                    〉.                             (17)
   Manual additional is made by the next rule
                                        〈                    〉.                                     (18)
   The implementation of the template constructor is to form templates of different types of
documents in accordance with their structure, defined by sections and subsections. Forming is
performed by executing algorithms associated with signature operations, according to the rules
defined by the information support of constructing.

5.3.    Software implementation
   The model presented in the previous section is the basis for the development of software for
automated template formation. The template will be used to improve the performance of the text
borrowings detection system for natural language structured digital documents [1].
   For software implementation of constructing of a structural tree for the document corresponding to
a condition of the same name in Figure 3, an algorithm can be constructed based on paragraph objects
and style retrieval methods provided by C# libraries [16, 17] and based on recursive processing of
xml-tree called XmlDocument C#, in which the contents of the document in xml-format can be
loaded.
   Functions have been developed to read the contents of a doc / docx file and build a document tree
based on the Composite pattern. They use described methods and are named object and xml. To
remove hidden text, the xml-based algorithm required the development of an additional text clearing
function. There was the investigation of the time efficiency of these methods for building a tree.
   Experimental base: 88 files in docx format with duplicates. Files are the documentation for the
diploma thesis of the Bachelor of Software Engineering DNURT-2018 (size: 0.7 MB – 27.3 MB,
107.2 – 310.3 thousand characters). The experiment was executed on a PC with the following
specifications: Intel (R) Core(TM) i5-9400F CPU, L1 code / L1 data cache / L2 – 6 * 32/6 * 32 /
6*256 KBytes, clock speed / system bus frequency / memory frequency – 2.9 GHz / 800 MHz /
2667 MHz, RAM access time (read / write) 5751/4253 MB / s, operating system – MS Windows 10
Home.
   The results were used to calculate comparative estimates for algorithms based on these methods of
working with the document:

                         ∑                                                                         (18)
              ∑                                                              {                     (19)

where is the average advantage of one algorithm over the second, is the area of the advantage of
the first algorithm over the second,                        are the time to build a document tree using
algorithms based on object and xml, respectively. The results showed that xml-algorithm is more
efficient in all cases. The average time of formation of the document tree                     is 16,4 sec,
         is 3,8 sec.
    Software components are currently being developed to search for keywords and edit the template
based on information from new documents. The template is an xml file. Its structure is shown in
Figure 4. The labels in the figure are the names of the tags.
    The constructor described in the sections above builds document templates in the basic
configuration (Figure 4a). It takes into account the possible depth of the structural elements of the
document and allows for deep analysis of the document. Thus, for comparison, the user can select not
only sections but also subsections (sub-items). This is relevant for a class of documents with
significant variability of components.
    However, if the documents are prepared according to certain strict standards, the structure of the
template can be simplified or supplemented. Templates of this type, which were built manually, were
used to recognize the structure of diploma projects. Experiments are underway. The experimental
base includes graduate theses of bachelors and masters works for specialty Software Engineering. The
works consist of two main parts: an explanatory note and appendices. The appendices contain the
technical task, working project and scientific publications of the author. The explanatory note has
strict regulated sections, which differ significantly from each other. Therefore, there is no need to
consider the detailed structure. The author's publications do not have a strict structure, so they are not
subject to verification by the template.


                       a                                             b
Figure 4: The structure of xml-template for the document: a – the base configuration, b – the
experimental configuration

   Experimental experience has shown that it is important for the subsection to specify only the title
and to save all keywords for the entire section (Figure 4b). Due to significant differences in the
content of sections, keywords are an important component of structure identification. Some words are
compulsory, such as “search methods”, “scientific novelty”, and so on.
   Some words are variable, so it is suggested that words be given weight (Figure 4b). Weight indicates
the relevance and importance of the keyword. Therefore, when recognizing a section by structure, the
user can set the required weight, which is the sum of the weight of the words. If a section does not have
keywords with a given weight, it potentially does not match the template element.
   The C# serialization mechanism is used to save the template to a file. For this reason, all
information is stored as a tag value, not its attribute.
   The task now is to develop an algorithm for calculating the weight of keywords and determining
the percentage and total weight of words that should be in the section so that it is recognized as an
element of the template.

5.4.    Summary and Future Works
   The document structure definition approach which proposed by the authors previously [1], requires
a human expert to forming the template. Therefore, this article presents the model of the template
constructing process in automatic mode.
   In this paper:
   1. the features of the document are defined for identifying document structure;
   2. the importance of taking into account keywords in identifying the document structure has
   been established. The completed review of related works showed the prospects of using graph
   approaches for extracting keywords with a preliminary analysis of the text on the basis of TF *
   IDF indicators;
   3. the xml-structure of documents was investigated, which made it possible to identify universal
   features for the identification of structural elements with different formatting;
   4. modeling of the process of forming a document template by means of UML was done, that
   allowed determining the basic operations and algorithms of the process model;
   5. the model of formation the document structure template has been developed by means of
   constructive-synthesizing modeling. This model encapsulates data and methods in a single formal
   object called constructor. It is used to automate template forming;
   6. some components of the developed model are implemented in software. Computer
   experiments have been carried out to investigate the time effectiveness of algorithms for
   constructing a structural tree of a document. The average advantage of the XMLDocument-based
   algorithm over the Microsoft.Office.Interop.Word object-based algorithm is about 75.5% and
   allows reducing the average time from 16.4 seconds to 3.8 seconds.
   Various configurations of the xml-template are currently being researched and discussed to
improve the efficiency of its use for working with documents in the academic environment.
   Further areas of work are:
       implementation and testing of keyword extraction algorithms;
       full software implementation of the developed model;
       investigation of stemmatization for working with headings of structural sections and
   extraction of keywords in order to improve the accuracy of the model as a whole;
       research the possibility of using existing software implementations of stemmatization for the
   Ukrainian language by library functions with bigger accuracy non C#.

6. Conclusion
   Taking into account the structure of the document in text borrowing detection avoids false
positives associated with the use of reference information and links. The proposed approach based on
the template allowed performing the recognition of document parts by content in order to further
select sections for check and better selection of materials for comparison.
   The creation of the model that is the constructor allowed to define and formalize all operations and
the data necessary for automation of construction of a template. This model can have several
implementations, depending on the algorithms interpretation of the described operations, which
makes it universal.
   The automation of template construction allows using the template approach for comparison of the
documents of different types: qualification works, technical documentation, etc. The developed
automated tool for forming templates is planned to be used to check the educational and qualification
works of higher education students.

7. References
[1] O. Kuropiatnyk, V. Shynkarenko, Text Borrowings Detection System for Natural Language
     Structured Digital Documents, Proceedings of the 4th International Conference on
     Computational Linguistics and Intelligent Systems, COLINS 2020, Lviv, Ukraine, 23–24 April
     2020. pp. 294–305.
[2] V. Shynkarenko, O. Kuropiatnyk, Constructive-synthesizing model of text graph representation,
     Proceedings of the 10th International Conference of Programming (UkrPROG'2016), Kyiv,
     Ukraine, May 24-25, 2016. vol. 1631, pp. 63 – 72.
[3] A. S. Vanyushkin, L. A. Grashchenko, Methods and algorithms for extracting key words. New
     information technologies in automated systems 19 (2016).
[4] T. V. Golub, M. Yu. Tyagunova, Method of steaming Ukrainian-language texts for classification
     of documents based on Porter's algorithm. Scientific works of Donetsk National Technical
     University. Series: Informatics, cybernetics and computer engineering 1 (2017 59–63.
[5] A. Hlybovets, V. Tochytsky, Algorithm of tokenization and steming for texts in Ukrainian
     (2017)
[6] E. V. Sokolova, O. A. Mitrofanova, Automatic extraction of keywords and phrases from
     Russian-language texts using the KEA algorithm. Computational linguistics and computational
     ontologies 1 (2017) 157–165.
[7] E. G. Grigorieva et al. Key word extraction algorithm based on the graph model of the linguistic
     corpus. Bulletin of the Volgograd State University. Series 2: Linguistics 16, No. 2. (2017).
[8] D. Suleiman, A. A. Awajan, W. Al Etaiwi, Arabic Text Keywords Extraction using Word2vec
     in: Proceedings of the 2nd International Conference on new Trends in Computing Sciences
     (ICTCS), 2019, pp. 1-7.
[9] W. Li, J. Zhao, TextRank algorithm by exploiting Wikipedia for short text keywords extraction
     in: Proceedings of the 3rd International Conference on Information Science and Control
     Engineering (ICISCE), 2016, pp. 683-686.
[10] Y. Wen, H. Yuan, P. Zhang, Research on keyword extraction based on word2vec weighted
     textrank in: Proceedings of the 2nd IEEE International Conference on Computer and
     Communications (ICCC), 2016, pp. 2109-2113.
[11] D. Suleiman, A. A. Awajan, W. Al Etaiwi, Arabic Text Keywords Extraction using Word2vec
     in: Proceedings of the 2nd International Conference on new Trends in Computing Sciences
     (ICTCS), 2019, pp. 1-7.
[12] H. Benghuzzi, M. M. Elsheh, An Investigation of Keywords Extraction from Textual Documents
     using Word2Vec and Decision Tree. International Journal of Computer Science and Information
     Security (IJCSIS) 18, No 5 (2020).
[13] S. Song et al. A Novel Text Classification Approach Based on Word2vec and TextRank
     Keyword Extraction in: Proceedings of IEEE Fourth International Conference on Data Science in
     Cyberspace (DSC), 2019, pp. 536-543.
[14] S. K. Biswas, M. Bordoloi, J. Shreya, A graph based keyword extraction model using collective
     node weight. Expert Systems with Applications 97 (2018) 51–59.
[15] V. Shynkarenko, O. Kuropiatnyk, Constructive Model of the Natural Language. Acta
     Cybernetica 23, no 4, (2018) 995–1015. doi: 10.14232/actacyb.23.4.2018.2.
[16] E. T. Sakhno, E. V. Romashka, Comparative analysis of existing libraries for creating and
     processing documents with C# .NET means. Bulletin of Lugansk National University named
     after Volodymyr Dahl 1-2 (2017) 80–82.
[17] A. N. Vildanov, From the experience of automating Word in C# on the example of creating a
     table of contents. Cloud of science 5, No 1. (2018).

</pre>