=Paper=
{{Paper
|id=Vol-2630/paper_8
|storemode=property
|title=Comparative Research of Index Frequency - Morphological Methods of Automatic Text Summarisation
|pdfUrl=https://ceur-ws.org/Vol-2630/paper_8.pdf
|volume=Vol-2630
|authors=Alexsander Osochkin,Vladimir Fomin,Olga Yakovleva
}}
==Comparative Research of Index Frequency - Morphological Methods of Automatic Text Summarisation==
<pdf width="1500px">https://ceur-ws.org/Vol-2630/paper_8.pdf</pdf>
<pre>
            Comparative Research of Index Frequency -
            morphological Methods of Automatic Text
                         Summarisation∗
          Alexsander Osochkin                Vladimir Fomin                 Olga Yakovleva
             osa585848@bk.ru                 vv_fomin@mail.ru              ekzegeza@yandex.ru

                          Herzen State Pedagogical University of Russia
                              Saint Petersburg, Russian Federation


                                                 Abstract

            The article considers the potential of frequency-morphological analysis implementation
        in index methods of automatic text summarisation. The main feature of the developed index
        method using frequency-morphological analysis is the consideration of the importance of
        parts of speech in the particular language. The evaluation of the effectiveness of automatic
        text summarisation of scientific and educational documents, fiction in Russian using various
        indexing methods is presented in the paper. Based on the experiments results, indexing
        methods were evaluated and quality ranked in automatic text summarisation algorithms,
        recommendations for their use were made.
            Keywords: frequency-morphological analysis, automatic summarisation, indexing meth-
        ods, morphological analysis


1       Literature review
      The extraction of knowledge from nature language (NL) data, including the provision of
information in short form, has always been a main topic in the educational field, especially the
implementation of information into the educational process. Increasingly, the topic “Redun-
dancy checking algorithms” is becoming the main topic in the field of information processing
[Lei, 2017], [Salloum et al., 2017]. One of the first attempts to summarise texts was described
in the works of [Luhn, 1958], the authors of which faced the problem of defining important
parts of the text, words in the text. Over the years, several techniques have been applied
for solving this problem including some recent attempts using neural networks [Shari, 2018],
[Molchanov, 2015], [Jansen, 2010]. Today, the authors are increasingly resorting to frequency and
frequency-morphological methods of text analysis in automatic summarisation [Said et al., 2017],
[Al-Emran, 2017], this trend is caused by the simplicity of implementation of these algorithms
based on the research of papers in the field of auto summarisation on the platform EBSO search,
it was revealed that the number of works [Getahun, 2017] in the field of auto-conversion is
steadily increasing by 17% every year, and frequency-morphological methods of indexing become
the predominant method.
    ∗
     Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).


                                                     1
2     Introduction

      Automatic summarisation is an automatic process that creates a result text from one or
more source texts that transmits most of the information in smaller size. [Brandow et al., 1995].
Today, there are many different methods of automatic summarisation [Clayton et al., 2011],
[Evdokimenko, 2013], [Al-Emran, 2017], [Salloum et al., 2017], [Sujit et al., 2013], [Shari, 2018],
but among all automatic summarisation methods, it is particularly worth highlighting indexing
methods, which are based on simple, well-proven frequency analysis methods [Sujit et al., 2013],
[Shari, 2018], [Molchanov, 2015], [Jansen, 2010] of text information on NL.
      In the field of text-mining and natural language processing NLP, frequency analysis is the
predominant method of text analysis, [Yogesh, 2014] but increasingly in scientific articles there
is a shift to more complex methods of text analysis, identification of the basis of sentences,
etc., using frequency-morphological analysis. The main problem of automatic summarisation is
the identification of the most significant parts of the text, which removed from the text would
save the integrity and reflect the main topic of the document. There is a quite large amount of
automatic summarisation methods developed over the past two decades, but all of them can be
conditionally divided into two groups: extraction methods and abstraction methods.
      Abstraction methods are automatic summarisation methods based on the creation of a new
text using new words and synonyms, consolidating the original text. These methods are of great
scientific interest, especially in the field of NLP, as they involve the use of complex semantic
analysis algorithms.
      Abstraction methods include three necessary steps:
      1) creation of the text main idea, frequently used words and main topic identification, etc.;
      2) indexing of words, phrases and other meaningful units;
      3) indexing-based new text consolidation and synthesis.
      Extraction methods are a method of automatic summarisation text in which words with
low indexes are extracted from the text. The distinctive feature of this method is saving the
original text. The algorithm identifying the importance of sentences and words using indexing
methods, which allow ranking elements of the text: words, sentences, paragraphs. The majority
of industrial-scale automatic summarisation systems are implemented within the framework of
this approach [Yogesh, 2014], although these systems also have a number of problems.
      Regardless of the type of automatic summarisation method, each uses indexing of the text
internal content, in order to rank the text elements and save the most significant ones. Therefore,
the most important step for two kinds of automatic summarisation is the step of indexing the
text internal content. Despite the fact that this stage is key for summarisation method, there
is no general reliable indexing method, which would be effective in a large number of different
tasks. Therefore, there is a wide range of indexing algorithms; each of them includes an effective
method of text summarisation applying to the particular structure or text type.
      The first automatic summarisation methods were based solely on frequency or positional
analysis based on the analysis of each individual word or its position in the text. With the
development of text-mining and NLP, more sophisticated methods of semantic and linguistic
analysis began to be applied, scientists tried to apply text analysis with the higher meaning-
ful units as paragraphs, thematic parts, sentences and. etc. The main problem of methods
based on semantic and linguistic analysis is the lack of comparison of the sentences impor-
tance [Baxendale et al., 1958], [Yogesh, 2014], [Shari, 2018], [Dragomir, 2012]. This feature sig-
nificantly reduces the quality of automatic summarisation, it refers in a big extent to the texts,
where the author mentions several topics as well as to the artistic texts.
      The main problem with indexing is that there is no consensus about what minimum unit
of text analysis is the best for auto summarisation. On the one hand, frequency methods of
text indexing are analysed at the level of unigrams, a separate word cut from the context, on

                                                2
the other hand semantic and linguistic methods of indexing involve sufficiently large meaningful
units such as paragraphs, subsections, sentences, etc.
     Due to the same problem in two types of automatic summarisation indexing, we have chosen
indexing methods based on frequency analysis, as they are more universal, less dependent on
language specificity, there are ready-made software solutions for indexing documents and these
methods do not require specialized knowledge in the field of linguistics.
     In the science researchers of automatic summarisation [Sujit et al., 2013], [Shari, 2018],
[Molchanov, 2015], [Jansen, 2010], [Yogesh, 2014], [Tarasov, 2010], [Gambhir, 2016] it was re-
vealed that modern algorithms related to indexing of text including the frequency method are
based on uniform analysis of text, and ignore the analysis of higher organized units: phrases,
sentences, and paragraphs. When analysing higher units of text, new properties appear: co-
hesion, coherence, and auto-somatization of individual paragraphs and text lines, etc. The use
of more highly organized units while indexing a document requires a transition to frequency-
morphological analysis, to identify and use special properties of phrases, sentences, paragraphs.

2.1   Relevance and purpose of the article
      Information redundancy is a major problem in various information environments, where
huge amounts of semi-structured data in natural language are accumulated. This problem is
particularly relevant to information educational environments, because the provision of brief
reference material allows to speed up the process of searching for the necessary information,
which affects the level and quality of education in general. Nowadays there is no doubt that
intelligent search significantly increases efficiency in any information environment that searches
in huge amounts of semi-structured data in natural language. In such circumstances, new effective
methods for dealing with large amounts of information that can convey the exact content of a
document in a concise form are of particular importance.
      One of these methods is automatic summarisation as a type of analytical and synthetic
document processing [Sujit et al., 2013], [Shari, 2018], [Radev et al., 2002], which allows the re-
quired information support [Molchanov, 2015], [Jansen, 2010]. The purpose of the study is to
evaluate the effectiveness of automatic text summarisation using index methods, including the
use of frequency-morphological analysis.
      In order to achieve this purpose, the following tasks were set: To analyse approaches to text
auto summarisation based on index methods. Selecting Frequency indexing Methods to generate
automatic summarisation of text materials in Russian. To modify the selected index methods
using frequency-morphological analysis as the main one. To evaluate and compare the results
of auto summarisation, frequency and frequency morphological analysis in terms of accuracy,
completeness and amount reduction from the source text and the standard.


3     Automatic summarisation algorithm
     When using index automatic summarisation methods, any Dj text in natural language can
be represented as a set of words: W = w1, w2 ,... wn . Where, each word wn has an index
F obtained by calculating the frequency indexing method. When using the frequency analysis
based indexing method, text is represented at the elementary level, thus the main elementary
unit of frequency analysis is the word.
     We propose to complicate frequency indexing methods by adding morphological analysis,
thanks to which we will be able to obtain another set of V index.
     The morphological index V is determined on the basis of the importance of the part of
the NL speech on which the text is written. As a result of document indexing by means of
frequency-morphological analysis, index P will be obtained, for which the following statements

                                                3
are correct (formula 1):
                                            P =F ∗V                                             (1)

      The document indexing process can be represented as a number of steps:

  1. Indexing of the document.

  2. Carrying out morphological analysis of text.

  3. Obtaining a combined frequency-morphological index.


3.1    The index of the document
      The first step in automatic text summarisation is to apply a frequency indexing method
that will allow the calculation of index F for each word in the text. The calculation of index
F depends on the algorithm or method of indexing, in the following the obtained index will be
used for calculations with the index of morphological analysis V. In this paper, we selected the
following as the main algorithms.
      TF-IDF. Luh in 1957 developed a method of analysing text information, which allows to
identify the most significant, relevant words that were supposed to be used to classify documents
in natural language [Luhn, 1958]. At the heart of the TF-IDF method is frequency analysis,
and the hypothesis that the most important words in the test are, in more often than the
rest of the words in the text. Thanks to this approach, the TF-IDF method can be used not
only to classify documents, but also to expand its application, using it to reduce information
redundancy. Sentences that do not contain the most significant words are removed from the text.
The remaining test is then subject to linguistic analysis to agree on the remaining sentences in
the text.
      TF-ISF. TD-IDF modification [Luhn, 1958]: aimed at testing the hypothesis that the most
important words are used more than once in a single sentence, but are rarely found throughout
the document.
      Collocations. Technology of identification of significant sentences in text, which is based on
analysis of weight of phrases. Indexing of significant sentences is calculated as a sentence with
a common word, of the total number of sentence. Position analysis of offers. This technology of
indexing the most important sentences is limited to the hypothesis that all the main sentences are
used at the beginning and end of the text to be indexed, thus, the largest index dials sentences
at the beginning and end of the text, which are then auto summarised.
      The signal method. Theory-based technology that key and most important sentences use
specific words: meaningful, complex, heavy, tasks, goals, etc. The words are used from a special
dictionary developed by H.P. Edmans.
      Neural networks. Deep machine learning - which appeared relatively long ago, actively
developing direction, which has found application to a wide range of problems: robotics, training
and recognition of graphic information and intelligent search. One of the most important works in
the field of in automatic summarisation in recent decades was the results of Collport’s research
[Shari, 2018], [Molchanov, 2015], which developed a unified procedure for machine analysis of
text. Many modern automatic summarisation software use Collobert method [Collobert, 2008].
Using the Collobert approach allows to index by semantic and linguistic importance parts of
the text: sentences, paragraphs, etc. automatic summarisation of text using depth learning in
a neural network, differs from a conventional neural network by the number of layers, which
contributes to more complex calculations.

                                                 4
3.2    Morphological analysis
     The use of morphological analysis in the indexing of documents allows to apply more com-
plex methods of calculation of indices of documents taking into account the special specificity of
NL.
     After investigation of many different texts on classification and automatic summarisation,
it was found that the most used words in the sentence, and the most rarely found in the text,
are much less important than the words located in a certain area of frequency of use (see Figure
1).


              Figure 1: Frequency and significance of words in frequency analysis


      Figure 1 presents the concept of evaluating of the meaning of a word in a text, used in most
of TF-IDF modifications, which suggests that in natural language, auxiliary and little meaningful
parts of speech that are most often used, do not carry a meaningful sense load to convey the
content of the document, but merely complement the material presented. Rarely used words in
the text are also unable to display the topic of the text, because as a rule the most rare words
are synonyms, manifestations of the specific style of the author, etc., which are not ways to
characterize the text as a whole.
      In order to assess the significance of parts of speech, a small body of texts con-
sisting of 200 artistic arbitrations of modern writers in Russian language was collected
[Internet portal “Bookzip”], all books were related to different literary genres. The method of
evaluation of the significance of parts of speech is based on the use of interrelated parts of speech
in the text, for example, a verb and a noun, which are used in the sentence as subject and
predicate, as well as on the uniform frequency of the use of parts of speech in text. “Solaris En-
gine” is a morphological analysis library used for identifying parts of speech, analysing bigrams,
sentences, etc. This library was selected based on a number of researches [Fomin et al., 2019].
The morphological index of each word is calculated V by formula (formula 2):
                                                   Pi
                                           Sn            Q
                                 P n = Pn       + Pn i=1 s                                       (2)
                                        n=1 w n    n=1 w n ∗ h
      Where Sn – is the frequency of use of the n part of speech in Dj text , wn – n word in Dj

                                                 5
text , Qs - frequency of use of combinations Sn part of speech witch other part speech, h – is
the count of parts of speech in the NL on which the analysed text is written. Figure 2 shows the
results of indexing parts of speech, based on 200 artistic texts in Russian.


                              Figure 2: Indexes of parts of speech


     Figure 2 Shows morphological indices calculated on the basis of formula 2. The results of
morphological indexing of parts of speech in Russian language showed that the most commonly
used part of speech are verbs, and nouns, as well as combinations thereof. The third most
important part of speech was adjectives, which were often used in combination with nouns,
further increasing the index of the given part of speech. The smallest index is received by service
parts of speech, which are not included in the main parts of sentences.


4    Experiment
     Based on the analysis of [Yogesh, 2014], [Tarasov, 2010], [Gambhir, 2016] methods for auto
summarisation, the “Rouge” method was chosen [Yogesh, 2014], [Gambhir, 2016] because the
method is easy to modify, has many varieties and is less prone to the element of chance. Accuracy
calculations are carried out using the freely distributed “Rouge” application on [GitHub “Rouge”].
     The “Rouge” method is based on the use of bigrams. A unique feature of this method is the
detail of units of measure, Rouge - 1 for example, considers a word as a minimum unit, in Rouge
- 2 a minimum unit is a bigram, this evaluation method is called “Rouge – N”. There are other
types of evaluation methods, “Rouge - S” – takes measurements based on the bigrams taking
into account changes in the text sequence, “Rouge - L” takes measurements based on the longest
chain of bigrams of the matching sequence between the template and the text abstract, etc.
     According to the “Rouge” metric, each summary is compared by two indicators called “F1
score” [Getahun, 2017] Precision, Completeness Recall, and based on these two indicators, an-
other overall indicator is calculated - the measure of accuracy of the test measures. You can
read more about methods of evaluating the results of summarisation on the Internet portal for
natural language processing “RxNLP” [Internet portal “Portal”, “NLP text-mining”]. Frequency-
morphological analysis is performed by special software. To date, there is quite a large number of

                                                6
morphological analysis libraries. In the study we will apply the morphological analysis method-
ology developed by the authors [Fomin et al., 2019].
     The difference of calculating indices using frequency-morphological analysis is the limitation
of text, i.e. bringing all words into the initial form, identifying chats of speech of each sentence
member, breaking according to dictionaries of morphological modules, identifying parts of speech,
determining grammatical basis of sentences and chains of parts of speech.

4.1      Text corpora
      In order to conduct an experiment comparing frequency and frequency-morphological anal-
ysis in indexes, two special corpora of text were collected. For implementation of comparison,
assessment of accuracy, completeness and efficiency of summarising, methods need a reference
text, as a rule, the reference text means the paper, the composition, the essay written by the
person.
      An important element in auto summarisation, reduction of the text redundancy, is the
preservation of the basis of the text, which allows to preserve the main theme of the text.
Within the framework of automatic summarisation it is common to define two types of texts
[Yogesh, 2014], [Tarasov, 2010], [Gambhir, 2016]:
      • context-identifiable - these texts are expected to describe specific issues, problems, topics;
      • context-indelible - in these texts, there is no clearly marked theme, it can be hidden,
including from the reader, in the general context1 .

Context-identifiable corpora The corpora “Dissertation” refers to context-identifiable issue
automatic summarisation and is represented by graduate works for obtaining the PhD degree,
collected from various sites of universities of the Russian Federation in different directions and
specialties, to each thesis an autoabstract is attached.

                                               Table 1: Dissertation

                         Science field      Count Dissertation        Count autoabstract
                              IT                   30                        30
                            History                30                        30
                          Chemistry                30                        30
                        Jurisprudence              30                        30
                            Biology                30                        30
                           Medicine                30                        30
                          Pedagogics               30                        30
                            Physics                30                        30
                         Philosophy                30                        30
                           Economy                 30                        30


     All dissertations and autoabstracts presented in the corpora of texts are published during
the 2007 to 2019 period. The average size of one dissertation: 142 pages or 76,964 words, the
average length of the autoabstract is 22 pages or 8,464 words. The works presented by one subject
area have different topics and directions, for example, for Jurisprudence, the works discuss the
problems of the civil code of the Russian Federation, the judicial document of production, the
Customs Code of the Customs Union, etc. The reference text of the abstract in this corpora
of texts, is considered the autoabstract to the thesis written by the author. As a comparison,
  1
      As a rule, context-indelible group of texts includes artistic works


                                                           7
abstracts created by joint application of indexing technologies with morphological analysis are
used.

Context-indelible corpora The “Art literature” corps refers to Context-indelible corps and
is represented by various artistic works in Russian, different time eras. As the reference text,
works and essays taken from the Internet with a retelling of the content of the artistic ration are
used. The body “Art literature” is presented in Table 2

                                     Table 2: Art literature

                             Author          Count works    Count essays
                        F. M. Dostoevsky         20             20
                           A. I. Kuprin          20             20
                          L. N. Tolstoy          20             20
                         A. P. Chekhov           20             20


     The second comparative corpora of texts of automatic summarisation is made similar to
comparative abstracts in the corpora “Dissertation” where abstracts were created using different
indexing technologies together with morphological analysis.

Table 3: Evaluation automatic summarisation context-identifiable corpora corpora based on
frequency indexing methods

                                                                   Rouge N
 Method             Evaluation
                                   1            2            3         4              5               6
                    Precision      0,2110718    0,1731205    0,144557  0,0861217      0,0643702       0,1828707
 TF-IDF             Recall         0,1750908    0,143609     0,1199146 0,0714407      0,0533971       0,151697
                    M-measures     0,191405     0,1569898    0,1310878 0,0780972      0,0583724       0,1658315
                    Precision      0,2036052    0,1864228    0,1415896 0,0829204      0,0702028       0,1881112
 TF-ISF             Recall         0,168897     0,1546437    0,117453  0,0687851      0,0582354       0,1560442
                    M-measures     0,1846341    0,1690527    0,1283968 0,0751942      0,0636615       0,1705838
                    Precision      -            0,2406241    0,2413066 0,2201747      0,184369        0,0403017
 Collocations       Recall         -            0,1710078    0,1714928 0,1564747      0,1310282       0,0286418
                    M-measures     -            0,1999291    0,2004962 0,1829382      0,153188        0,0334857
 Position           Precision      0,1940082    0,1577003    0,1471301 0,1185674      0,1773649       0,0382542
 analysis of        Recall         0,1378786    0,1120751    0,104563  0,084264       0,1260504       0,0271866
 offers             M-measures     0,161197     0,1310296    0,122247  0,098515       0,1473684       0,0317845
                    Precision      0,1837544    0,1677501    0,1347171 0,1243346      0,1748481       0,1939582
 The signal
                    Recall         0,1305914    0,1192174    0,0957413 0,0883626      0,1242618       0,137843
 method
                    M-measures     0,1526773    0,1393798    0,1119334 0,1033068      0,1452773       0,1611554
                    Precision      0,1778164    0,1273729    0,1199331 0,1039203      0,1449691       0,1275056
 Neural networks    Recall         0,1778225    0,1273772    0,1199372 0,1039239      0,144974        0,12751
                    M-measures     0,1778195    0,1273751    0,1199352 0,1039221      0,1449715       0,1275078


4.2   Evaluation of automatic summarisation of context-identifiable corpora
    We will evaluate the automatic summarisation generated by various indexing methods,
which are based exclusively on frequency analysis. Results of evaluation of automatic summari-

                                                8
sation by “Rouge-N” metric the results are presented in Table 3 below.

Table 4: Evaluation automatic summarisation context-identifiable corpora based on frequency-
morphological indexing methods

                                                                     Rouge N
 Method                Evaluation
                                      1              2          3        4          5          6
                       Precision      0,361573       0,381362   0,297024 0,304661   0,333702   0,292211
 TF-IDF                Recall         0,113437       0,119649   0,09319  0,095561   0,104694   0,091657
                       M-measures     0,172694       0,18214    0,141873 0,145484   0,159387   0,139563
                       Precision      0,403672       0,429321   0,332682 0,342371   0,374109   0,324925
 TF-ISF                Recall         0,113936       0,121172   0,093881 0,096635   0,105591   0,091721
                       M-measures     0,17771        0,189003   0,146432 0,150721   0,164691   0,143054
                       Precision      -              0,435501   0,33628  0,345765   0,378211   0,3322
 Collocations          Recall         -              0,5421     0,419126 0,430326   0,470701   0,413442
                       M-measures     -              0,483016   0,373494 0,383449   0,419412   0,368392
 Position              Precision      0,365606       0,327589   0,302069 0,300216   0,340532   0,303881
 analysis              Recall         0,224472       0,201131   0,185469 0,184353   0,209048   0,186571
 of offers             M-measures     0,278161       0,24923    0,229811 0,228449   0,259028   0,231201
                       Precision      0,438151       0,464082   0,360248 0,369892   0,405142   0,356213
 The signal method.    Recall         0,300237       0,318511   0,246854 0,253461   0,277611   0,244101
                       M-measures     0,356321       0,378001   0,292976 0,300803   0,329477   0,2897
                       Precision      0,435757       0,46195    0,358354 0,371772   0,403731   0,354252
 Neural networks       Recall         0,525716       0,557341   0,432354 0,448514   0,487161   0,427411
                       M-measures     0,476555       0,505182   0,391897 0,406526   0,441572   0,387411


      The absence of Rouge-1 indexing in the “Collocation” indexing method is a consequence of
the inability to use unigrams in text indexing.
      In the evaluation of automatic summarisation by the Rouge-1 method, the highest accuracy
is achieved in autoabstract generated by the method of Neural Network indexing, but the best
overall correspondence (m-measures) was achieved by TF-IDF This situation can be explained
by the fact that the “Recall” of autoreferences obtained by TF-IDF indexing is less than the
“Recall” of autoabstract, but the number of words that often coincided with the the reference
increased, while neural networks used service parts of speech more often (43.34%) than in the
TF-IDF method.
      When using bigrams (Rouge-2), the best “Precision” and “Recall” was shown by the “Col-
locations” method, which suggests the presence of coherence in words in the referenced texts.
Increasing the Rouge-3 sequence, after the bigrams, decreases the accuracy of almost all methods
except the phrase. Increase the sequence of n-grams resulted in reduced “Precision” and “Recall”
with the maiming of the chain of n-grams. The best result when using 6 words long n-grams,
was shown by the TF-IDF method.
      Table 4 presents the results of automatic summarisation of dissertation, in which indexing
was carried out on the basis of frequency-morphological analysis. As with frequency analysis,
after indexing and shortening the text, morphological libraries were used to reconcile sentences.
      The best indicator of “Recall” in the evaluation of unigrams (Rouge-1) was shown by the
method of positional analysis of sentences method.
      When using the position analysis of offers method the “Recall” of the main sections “Intro-
duction”, problem, “Conclusions”, almost did not decrease, because positional they are located
at the beginning and end of the texts, in case the volume of conclusions was sufficiently small,

                                                 9
the abstract included relevance, problem and methodological part of the dissertation.
     With high “Recall”, the number of words matching the reference text was extremely small,
which made the overall “Precision” of the summarisation of the method low.
     The lowest estimate was found in auto summarisation, where the indexing of texts was
carried out with method TF-IDF. The average accuracy of TF-ISF is higher than that of TF-
IDF, this result indicates that, with increasing text volume, words that are often used within a
single sentence are often used in autoreferences written by humans.
     Better “Precision” is achieved with any method when using unigrams. When evaluating
autoreferences with unigrams, the similarity with the reference falls, except for the collocations
method. Neural networks reached the best M-measures in auto summarisation, with neural
networks generating the largest volume abstracts.

4.3   Evaluation of automatic summarisation of context-indelible corpora
     We will evaluate automatic summarisation generated by various indexing techniques based
on frequency analysis. Results of evaluation of automatic summarisation by “Rouge-N” metric
the results are presented in Table 5.


Table 5: Evaluation automatic summarisation context-indelible corpora based on frequency in-
dexing methods

                                      Rouge-N
 Method                Evaluation
                                      1           2          3          4           5          6
                       Precision      0,309992    0,226085   0,093911   0,076511    0,158793   0,038357
 TF-IDF                Recall         0,270655    0,299204   0,170447   0,085492    0,143018   0,039314
                       M-measures     0,337491    0,107212   0,235627   0,060411    0,199072   0,160076
                       Precision      0,506626    0,458609   0,229601   0,195824    0,145352   0,078777
 TF-ISF                Recall         0,251234    0,098935   0,239602   0,158155    0,179512   0,143286
                       M-measures     0,292377    0,229081   0,092807   0,178233    0,082385   0,045001
                       Precision      0,278616    0,298429   0,206652   0,063323    0,145968   0,045118
 Collocations          Recall         0,552911    0,333939   0,333282   0,279554    0,23204    0,122982
                       M-measures     0,360896    0,305001   0,250727   0,222087    0,232677   0,063547
 Position              Precision      -           0,267986   0,126883   0,105382    0,118786   0,092934
 analysis              Recall         -           0,318982   0,269306   0,300199    0,170863   0,24958
 of offers             M-measures     -           0,157603   0,141343   0,231817    0,282642   0,117228
                       Precision      0,249845    0,111706   0,068327   0,152547    0,119657   0,110365
 The signal method.    Recall         0,334284    0,258972   0,194206   0,085959    0,130805   0,067205
                       M-measures     0,291674    0,245833   0,14378    0,132268    0,176823   0,133074
                       Precision      0,273879    0,182688   0,092082   0,108367    0,220331   0,090309
 Neural networks       Recall         0,453305    0,440339   0,323477   0,123206    0,130873   0,131392
                       M-measures     0,433696    0,244462   0,186702   0,120517    0,16168    0,205693


     As a result of estimation by the method of Rounge-1 of automatic summarisation generated
on the basis of frequency indexing methods, the best method was neural networks, where the
value of the function reached 43.36%. The best correspondence when evaluation the match
with the standard of bigrams, showed automatic summarisation generation by the method of
“Collocations” where the text of the reference was similar to the writing in 30.5% of cases. In
the evaluation of n-grams of Rouge-3-6, automatic summarisation obtained by the “Collocations”
indexing method also have the best m-measure value with the reference. Now we will carry out

                                                 10
comparative analysis by Rouge-N procedure, using frequency-morphological analysis in indexing,
results of which are presented in Table 6.

Table 6: Evaluation automatic summarisation context-indelible corpora based on frequency-
morphological indexing methods

                                     Rouge-N
 Method               Evaluation
                                     1           2          3          4          5          6
                      Precision      0,426539    0,522639   0,502415   0,446527   0,509916   0,427969
 TF-IDF               Recall         0,388721    0,476286   0,459437   0,406925   0,464692   0,390012
                      M-measures     0,406747    0,498387   0,480756   0,425807   0,486255   0,408117
                      Precision      0,600605    0,732799   0,68317    0,634922   0,712932   0,605985
 TF-ISF               Recall         0,40032     0,488431   0,455352   0,423193   0,475189   0,403905
                      M-measures     0,480424    0,586166   0,546468   0,507874   0,570274   0,484727
                      Precision      -           0,558879   0,519882   0,483895   0,513947   0,441059
 Collocations         Recall         -           0,970924   0,903175   0,840656   0,892864   0,765423
                      M-measures     -           0,70941    0,659909   0,61423    0,652376   0,551926
 Position             Precision      0,413795    0,494613   0,479276   0,438813   0,448022   0,392228
 analysis             Recall         0,635545    0,759672   0,736116   0,673971   0,688101   0,602825
 of offers            M-measures     0,501239    0,599136   0,580558   0,531545   0,542697   0,475178
                      Precision      0,315449    0,376193   0,367336   0,342419   0,360224   0,314927
 The signal method    Recall         0,391679    0,467101   0,456104   0,425166   0,447273   0,391031
                      M-measures     0,349455    0,416747   0,406936   0,379332   0,399057   0,348877
                      Precision      0,450621    0,535067   0,509098   0,549532   0,546561   0,541471
 Neural networks      Recall         0,576912    0,685024   0,651778   0,703543   0,699739   0,693223
                      M-measures     0,506005    0,60083    0,571672   0,617073   0,613737   0,608021


     Evaluation of automatic summarisation generated on the basis of frequency-morphological
analysis using the Rouge-1 technique showed that the highest compliance with the reference, was
achieved in methods where the method of neural networks was used. The total compliance with
the standard of 50.6% was achieved, which is 17.94% better than when using frequency analysis
method. The “Precision” indicator reached 97%, which allows saying that almost all words used
in automatic summarisation are also found in the reference text.


Conclusions
     The experiments results show that moving from simple unigram frequency analysis to more
complex frequency-morphological analysis has a great impact on automatic text summarisation
quality. The “Rouge-N” method, used for evaluating the efficiency of automatic text summari-
sation, showed that autoabstracts made on base of frequency-morphological analysis were 16%
closer to the original than autoabstracts based on frequency analysis only, moreover automatic
text summarisation of fiction was 31,28% more accurate using frequency-morphological analysis
than using frequency analysis.
     While using frequency-morphological analysis in indexing documents was found that all
methods except TF-IDF increased the similarity with the original text. Comparing to the original
texts, which were written by people, it was found that text on average is accurate by 48% and
the text reduction reached 93,05% while dissertation referencing.
     The experiments results let us suppose the high potential of using index methods on base
of neural networks using frequency-morphological analysis in smart search or in information-

                                                11
educational fields. We are expecting to expand the scope of application fields of frequency-
morphological analysis in fiction auto reference and in indexing of parts of speech usage frequency,
in order to classify data on NL.


Acknowledgements
      The research was supported by the Russian Science Foundation (RSF), Project “Digital-
isation of the high school professional training in the context of education foresight 2035” No
19-18-00108.


References
[Brandow et al., 1995] Brandow R. Mitze K., and Lisa F. R. (1995) Automatic condensation of
        electronic publications by sentence selection // Inf. Process. Manag. Vol. 31. Pp. 1-8.

[Baxendale et al., 1958] Baxendale P. B. and etc (1958) Machine-made index for technical liter-
        ature: An experiment // IBM J. Res. Dev., Vol. 2, Pp. 354-363.

[Lei, 2017] Lei L. and etc. (2017) Redundancy checking algorithms based on parallel novel exten-
          sion rule. //Journal of Experimental Theoretical Artificial Intelligence Vol. 29 , 2017
          - Issue 3.

[Said et al., 2017] Said A.S. and etc (2017) Using Text Mining Techniques for Extracting Infor-
          mation from Research // Intelligent Natural Language Processing: Trends and Appli-
          cations Vol. 1. Pp.373-397.

[Salloum et al., 2017] Salloum S.A. (2017) A Survey of text mining in socialmedia: facebook
         and twitter perspectives //Advances in Science, Technology and Engineering Systems
         Journal Vol.2. Pp.127-133.

[Clayton et al., 2011] Clayton S.and etc (2011) Experiments in Automatic Text
         Summarisation      Using  Deep Neural  Networks   //   Machine   Learning,
         Fall Vol.1 2011. Avaible at:       https://www.semanticscholar.org/paper/
         545-Machine-Learning-%2C-Fall-2011-Final-Project-in-Ben-Rahul/
         8f4f64e15553baf9fd0c2933c631b78c97c8f0bc

[Radev et al., 2002] Radev D. R., Hovy, E., and McKeown K. (2002) Introduction to the special
         issue on summarisation. //Comput. Linguist, Vol. No 28. Pp. 399–408.

[Dragomir, 2012] Dragomir R.R. (2012) Single-document and multi-document summary evalua-
        tion via relative utility University of Michigan, Ann Arbor MI 48109 2012 Avaible at:
        https://www.eecs.umich.edu/techreports/cse/2007/CSE-TR-538-07.pdf

[GitHub “Rouge”] Application «Rouge», «GtiHub» Avaible at: https://github.com/kylehg/
         summariser/blob/master/rouge/ROUGE-1.5.5.pl

[Derczynski, 2016] Derczynski L. (2016) Complementarity, F-score, and NLP Evaluation // Pro-
         ceedings of the International Conference on Language Resources and Evaluation, Vol.
         1. Pp. 1-6.

[Sujit et al., 2013] Sujit R. Sujit V. and etc (2013) Classification of News and Research Articles
           Using Text Pattern Mining IOSR Journal of Computer Engineering (IOSR-JCE) Vol.
           14, Issue 5 . Pp. 120-126.

                                                12
[Luhn, 1958] Luhn P. (1958) The automatic creation of literature abstracts IETE // Journal of
         research J. Res. Dev., Vol. 2, no. 2. Pp. 159-165.

[Fomin et al., 2019] Fomin V., Osochkin A., and Zhuk Y. (2019) Frequency and morphologi-
         cal patterns of recognition and thematic classification of essay and full text scientific
         publications //NESinMIS-2019, 12-Jul-2019, CEUR-WS 2019. Vol. 2401 , 69-84. 2019.

[Evdokimenko, 2013] Evdokimenko E. Y. (2013) The Concept of Information Noise in the Social
        and Human Sciences // Molodoy ucheniy. Vol. 10. Pp. 564-566, 2013. Avaible at: https:
        //moluch.ru/archive/57/7765/

[Al-Emran, 2017] Al-Emran M., Shaalan K.(2017) Academics’awareness towards mobile learning
        in Oman // Int.J. Com. Dig. Sys. Vol.6

[Shari, 2018] Shari T. (2018) Optimize Optimize the A Commentary. Journal Search Voice, Vol.
          1, Pp. 1-6.

[Molchanov, 2015] Molchanov A.N and etc.(2015) A mathematical model of natural language
        text that takes into account the coherence property // Internet-journal “Science of
        science”. Vol. 7, No 1, 2015 Avaible at:https://naukovedenie.ru/PDF/70TVN115.pdf

[Jansen, 2010] Jansen, B. J. and Rieh, S (2010) The Seventeen Theoretical Constructs of In-
          formation Searching and Information Retrieval// Journal of the American Society for
          Information Sciences and Technology. Vol 61. Pp. 1517-1534.

[Yogesh, 2014] Yogesh M . et al. (2014) Analysis of Sentence Scoring Methods for Extractive
         Automatic Text Summarisation // Proceedings of the International Conference on In-
         formation and Communication Technology for Competitive Strategies. – ACM: NY,
         USA, vol. 1. Pp. 89-97, 2014.

[Gambhir, 2016] Gambhir M. Gupta. V. (2016) Recent automatic text summarisation tech-
        niques: a survey // Artificial Intelligence Review. vol.1 Pp. 1-66.

[Tarasov, 2010] Tarasov S.D. Modern methods of automatic referencing // Scientific and tech-
         nical statements of SPBPU. //Journal “Computer science, telecommunications and
         management”, vol. No 6 1-9 2010.

[Internet portal “Portal”, “NLP text-mining”] Avaible at:    http://rxnlp.com(update:30.09.
          2019).

[Internet portal “Bookzip”] Avaible      at:                              https://bookzip.ru/
          boeviki-ostrosjuzhetnaja-literatura/

[Getahun, 2017] Getahun T. and etc. (2017) Automatic Amharic Text Summarisation using
        NLP Parser international. //Journal of Engineering Trends and Technology (IJETT)
        – Vol.53, Pp. 52-58.

[Collobert, 2008] Collobert, R. and Weston, J.(2008) A unified architecture for natural lan-
         guage processing: Deep neural networks with multitask learning. // Conference: Ma-
         chine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008),
         Helsinki, Finland, June vol. 1, Pp. 1-8.


                                               13

</pre>