=Paper= {{Paper |id=Vol-2135/SEIM_2018_paper_30 |storemode=property |title=Keyword extraction from single Russian document |pdfUrl=https://ceur-ws.org/Vol-2135/SEIM_2018_paper_30.pdf |volume=Vol-2135 |authors=Mikhail Sandul,Elena Mikhailova }} ==Keyword extraction from single Russian document== https://ceur-ws.org/Vol-2135/SEIM_2018_paper_30.pdf
 Keyword Extraction from Single Russian Document

               Mikhail Vadimovich Sandul                                  Elena Georgievna Mikhailova
                      ITMO University                           Saint Petersburg State University, ITMO University
                    St. Petersburg, Russia                                     St. Petersburg, Russia
                    sandulmv@yandex.ru                                         e.mikhaylova@spbu.ru


    Abstract—The problem of automatic keyword and phrases            building index. They are also used to enrich the presentation
extraction from a text occurs in different tasks of information      o search results.
retrieval and text mining. The task is the identification of terms
that best describe the subject of a document. Currently there are        Despite the wide range of applications, most documents
a lot of research to solve this problem. Basically, algorithms are   do not have assigned keywords. Most approaches are focused
developed for texts in English. The possibility of applying these    on manual assignment of keywords. Usually, this procedure is
algorithms to the Russian texts are not sufficiently investigated.   done by specialists in the relevant field. In their work, they may
One of the most known algorithms for solving the problem             use a fixed taxonomy or rely on author’s judgment to provide
of keyword extraction is RAKE. This article examines the             a representative list[2]. The main goal of further investigations
effectiveness of RAKE algorithm for texts in Russian. The work
                                                                     is automatization of this process.
also applies the hybrid method, which uses the Γ-index metric
for phrases weighting, which were obtained using the algorithm          Early approaches to automatic keyword extraction focus
RAKE. The article shows that this algorithm is more accurate         on corpus-oriented statistics of individual words. However,
than PAKE while reducing the number of selected phrases.
                                                                     they have some flaws. For instance, while some words might
                                                                     be evaluated as keywords within the whole corpus, keywords
                      I.   I NTRODUCTION                             within a single document or several documents might not[2].
                                                                     Also, such methods can not help us in finding keywords
    Organizing data into a database requires expenses on
                                                                     consists of two or more words. To avoid these drawbacks, we
database designing, data organizing, etc. These expenses are
                                                                     focus on approaches that operate on a single document.
not necessarily related to the analysis of the data itself. So,
we spend additional time and effort on preprocessing to get              We will describe keyword as a sequence of one or more
new knowledge from a data. The worst part is that organizing         words which provide a representation of the document’s con-
some types of a data into a database can lead to a loss of the       tent. An algorithm should return us a list of such sequences.
content. Usually it’s impossible to convert text documents into      Ideally, this list represents in condense for the essential content
a table (in case of relational DB) without loss of its meanings.     of a document. However, there is no need for these sequences
Thus, texts are usually stored unchanged as BLOB-fields. As          to form coherent text. To avoid ambiguity, further in the article
a result, we can say that using databases for analysing texts is     we will refer to such sequences as key phrases to highlight
not efficient.                                                       that they may contain more than one word. We will use the
                                                                     term ”keyword” only to emphasize that we are dealing with a
   As we can see, it is hard to bring structure into a text for
                                                                     single-word sequence.
analysis. However, we would like to extract useful information
from it anyway. The branch of science which deals with such             To evaluate the efficiency of an algorithm we use Precision
problems is called Text Mining[1].                                   and Recall metrics. Recall is the fraction of correctly extracted
    Nowadays a lot of real-world problems are subjects of            keywords among all keywords in a document:
study of Text Mining starting from typical Data Mining                                     correctly extracted
problems such as classification and clustering to exclusively                            all keywords in a text
Text Mining problems like automatic keyword extraction and
annotation[1].                                                       Precision is the fraction of correctly extracted keywords to a
                                                                     count among all extracted keywords:
    Currently, amount of unstructured text information con-
                                                                                           correctly extracted
stantly grows. Keywords can help to get new knowledge from
it because they let us understand the meaning of a text without                               all extracted
reading it. Automatic keyword extraction is the subject of study
of Text Mining.                                                          Different approaches to automatic keyword extraction exist:
                                                                     supervised and unsupervised machine learning algorithms,
    Keywords can be applied to improve the functionality of          statistical approaches, approaches based on linguistic features.
information retrieval systems. For example, Phrasier[8] system       They all can be used to solve our problem. These methods can
uses keywords to find documents related to the primary one           be divided into groups if they domain-dependent or domain-
and words themselves serve as links between documents. That          independent, corpus-oriented or made for single documents,
allows a user to navigate within documents much faster. An-          require training set or not. In this particular work will be
other example is Keyphind[9]. It is the search engine for digital    considered domain-independent methods which do not require
libraries. Keywords are used there as the main source for            training set.

                                                                     30
         II.   R ELATED W ORK AND BACKGROUND                            It should be noted, that due to algorithm definition, the
                                                                    number of vertexes in a graph depends on the number of
A. TextRank
                                                                    distinct words in a text. So, for large texts algorithm can be
   TextRank[3] is the agile graph-based method that can be          very long to run.
used not only for key phrase extraction but also for creating
annotation for a text.                                                                     Precision   32.1%
                                                                                           Recall      36.2%
    Representing data as a graph is one of the approaches to
retrieve information. The basic idea implemented by a graph-        Tab. 1. TextRank evaluations
based ranking model is that of ”voting” or ”recommendations”.
When there an edge exists from one vertex to another, it is
basically casting a vote for that other vertex. The more votes      B. PageRank on Synonym Networks
a vertex gets, the more important it becomes. Moreover, an
importance of the vertex determines the importance of its               The problem of a graph growth can be partially solved
”votes”, and this information is also taken into account by         by merging words into groups. This can also improve the
ranking model.                                                      efficiency of the algorithm. One way to form groups is to define
                                                                    equivalence classes. Authors suggest merging words by their
    Formally, let G = (V, E) be a directed graph with the set       meaning. PageRank on Synonym Networks[7] implements
of vertexes V and the set of edges E, where E is the subset of      this idea. They need a thesaurus for that matter. Each class
V × V . Let In(Vi ) be a set of vertexes which has an edge that     can be represented by a unique number. Every word in a
goes to Vi . Let Out(Vi ) be a set of vertexes to which leads an    text is replaced by a unique number that corresponds to the
edge from Vi . We will define a score of a vertex Vi as follows:    equivalence class of the word. Then the algorithm described
                                       ∑                            in the previous paragraph is applied to the modified text.
             S(Vi ) = (1 − d) + d ·           S(Vj )
                                   Vj ∈In(Vi )                      C. RAKE
where d ∈ [0; 1] is a dumping factor, which has a role of inte-         RAKE[2] (Rapid Automatic Keyword Extraction) is a
grating into the model the probability to jump from the given       relatively effective algorithm that operates on individual docu-
vertex to another random one in the graph. The algorithm starts     ments, easily applied to new domains and does not depend
with arbitrary values of S(Vi ) for each node, the computation      on the structure of a document or specific grammar rules.
iterates until convergence below given threshold is achieved.       RAKE is based on the observation that key phrases frequently
   To apply such method to a text, first, we need to introduce      contain multiple words but rarely contain standard punctuation
some rules according to which we will build a graph. We need        or stop-words. As a stop-words, we denote function words
to define units which will be treated as nodes and relations        like ”and”, ”of”, ”the”, or other words with minimal lexical
between them. We can distinguish the following steps:               meaning. It is worth to mention that such words are also
                                                                    ignored in building an index in information retrieval systems.
 1) Identify text units that best define the task at hand, and      The algorithm requires the list of such words.
    add them as vertexes in the graph
 2) Identify relations that connect such text units, and add            In addition to a stop-word list, sets of phrase- and
    them as edges between respective vertexes                       word-delimiters needed. RAKE uses stop-words and phrase-
 3) Iterate graph-based algorithm until it converges                delimiters to partition a document into a set of candidate
 4) Sort vertexes by their score and use this score for selection   key phrases. Word delimiters are used to split candidates into
    decisions                                                       individual words.
                                                                        RAKE begins key phrase extraction with partitioning a text
    Such definition allows us to apply this method to a wide
                                                                    into candidate key phrases. On the next step, the set of unique
range of problems. In particular, we will show how to apply
                                                                    words is obtained by splitting candidates into individual words
this algorithm to key phrase extraction.
                                                                    with word-delimiters. After this, scores are calculated for
    First, we have to decide what to use as units when building     each word. Authors propose metrics based on word frequency
the graph. Authors suggested to this role nouns and adjectives.     and word degree. Word frequency (f req(w)) is a number of
They tried different part-of-speech filters but this one showed     occurrences of the word in candidates. Word degree (deg(w))
the highest performance. Two lexical units were related if they     is a total length of all candidates which contain the word. The
co-occur within a window of maximum N words, where N can            score of a word is defined as
be set anywhere from 2 to 10. As a result, authors constructed                                        deg(w)
an unweighted and undirected graph. The initial score for                                    s(w) =
                                                                                                     f req(w)
each vertex was set to 1. After this, authors ran the algorithm
described earlier. Once a final score was obtained for each         Then a score is calculated for each candidate key phrase and
vertex in the graph, first T vertexes with the highest scores       defined as the sum of its member word scores.
were selected. T may be set to any fixed value or may depend
                                                                        Authors evaluated the efficiency of the algorithm on the
on various factors for instance, on a text size. Then, sequences
                                                                    Inspec set, the set of 500 abstracts with manually assigned
of adjacent keywords are collapsed into a key phrases.
                                                                    key phrases. The same dataset was used in testing TextRank
   The dataset used in the experiments is a collection of           effectiveness. They also compared performance of the algo-
500 abstracts from the Inspec database, and the corresponding       rithm with different stop-word lists. The results are shown on
manually assigned key phrases. Results are shown on Tab. 1.         the Tab. 2.

                                                                    31
    As we can see, the difference in the Precision with gener-
ated stop-word list based on keyword adjacency (KA stoplist)
and with Fox’s stop-word list is quite noticeable. This means
that RAKE is strongly depends on the provided set of stop-
words.
           stop-word list          Precision     Recall
           KA stop-word list          33.7%      41.5%
           Fox stop-word list         26.0%      42.2%
Tab. 2. RAKE evaluations



D. Position Weighing                                               Fig. 1. Distribution of distances to the closest neighbor
    Position Weighing[6] is based on the fact that a word’s
position plays an important role in linguistics. Words in
different positions carry different entropy. For example, if a
word appears in an introduction or in a conclusion it usually          2) σ-index: To study the spatial distribution of a given
carries more information. This observation also applies to         word in a text, first, we denote by ti the absolute position in
words within the same paragraph: words appeared in a leading       the corpus of the i−th occurrence of a word. Thus we obtain a
and summarization sentences are often more important than          sequence of such positions: {t0 , t1 , . . . , tn+1 }, assuming there
those in other positions of the same paragraph.                    are n + 2 occurrences. Next, we need to compute the average
                                                                   distance between two successive words:
    Authors propose their approach called Position Weight. For                                    n
each word position they suggest computing a score which                                      1 ∑
                                                                                      µ=              (ti+1 − ti )
strongly depends on a paragraph and a sentence where the                                   n + 1 i=0
word occurrence is found and on a form of the word. Here is
the formula:                                                       The following step is to obtain the standard deviation:
                                                                                    v
        pw(ti ) = pw(ti , pj ) ∗ pw(ti , sk ) ∗ pw(ti , wr )                        u          n
                                                                                    u 1 ∑
                                                                               s=t                ((ti+1 − ti ) − µ)2
Where pw(ti , pj ) represents the score of an occurrence ti                            n + 1 i=0
within the paragraph j; pw(ti , sk ) represents the score of an
occurrence ti within the sentence k; pw(ti , wr ) represents the   To eliminate the dependence on the frequency of occurrence
score of an occurrence ti as a word for r. Then the total weight   for different words, authors suggest normalizing the token
of a term t in a document d computed as the sum of all position    spacing, thus they define σ-index as follows:
scores:
                                 ∑m                                                                 s/µ
                    P W (t, d) =     pw(ti )
                                   i=1
                                                                   Given that standard deviation grows rapidly when inhomogene-
                                                                   ity of a distribution spacing ti+1 − ti increases. However, a
     We can conclude from the definition of the algorithm that     value of sigma can be strongly affected by the change of a
it relies on the structure of a document. Also, we should notice   single occurrence position. Also, high values of sigma do not
when assumptions about a document structure are violated (for      always imply a cluster concentration.
example, there are no clues to determine a type of sentences
and paragraphs), the algorithm is reduced to simply calculating        3) Γ-index: To solve such problems, authors suggest con-
the frequency of a word.                                           sidering the average separation around the occurrence at ti .
                                                                   They define it as follows:
E. Statistical Approach                                                           (ti+1 − ti ) − (ti − ti−1 )   ti+1 − ti−1
                                                                          d(ti ) =                            =
    1) Weighing of Individual Words: In addition to the fre-                                   2                     2
quency of a word, one can also obtain information about            . For each position ti the cluster index γ is computed:
the distribution of a word. The authors of the approach                                  { µ−d(ti )
suggest using this information to determine the significance                                        , if d(ti ) < µ
                                                                                γ(ti ) =       µ
of individual words. The metrics proposed in [5] are based                                         0, else
on the phenomenon of the attraction of keywords in the text.
                                                                   Finally, the score of a word is obtained from the average of
So, for example, the authors give the distribution of distances
                                                                   all its cluster indexes:
to the nearest neighbor for the word ”NATURE” in the book
                                                                                                  n
”The Origin of Species” by Charles Darwin (Fig. 1). On the                                     1∑
graph, you can see that the distribution of distances resembles                             Γ=       γ(ti )
                                                                                               n i=1
an exponential distribution. In the case of noise words, this
property is much weaker. Thus, keywords tend to form clusters      Γ-index is more stable than σ-index. However, it is more time-
in a text unlike noise words.                                      consuming to evaluate than σ.

                                                                   32
                       Precision    31.8%                                                     0.4



                       Recall       74.2%
Tab. 3. Evaluations for RAKE without any modification
                                                                                              0.3




                                                                                  precision
                                                                                              0.2




   4) Extracting of Key Phrases: The metrics described above                                  0.1




makes it possible to rank only individual words. In [4], pro-
posed several ways to generalize metrics to two-word phrases:                                       2.5        5.0
                                                                                                                     length
                                                                                                                                7.5     10.0




    First Way: Rank whole word sequences. The weight of the
sequence is calculated similarly to a single word. The problem                                0.6




is that the frequency of the desired phrases may be insufficient
for analysis.




                                                                                  recall
                                                                                              0.4




    Second way: Make up all possible unique phrases of a
given length. To compute the score of a phrase we need to                                     0.2




sum up all score of individual words from which the phrase
consists. The problem with this approach is that the number                                         2.5        5.0
                                                                                                                     length
                                                                                                                                7.5     10.0




of phrases grows exponentially with the given length.
                                                                     Fig. 2. The dependence of the efficiency on the restriction on the
                                                                     length of a phrase
                     III.   E XPERIMENTS
A. Data Description
                                                                              Max. length                 Precision           Recall   NWords
    We used the book ”Abel Theorem in Solutions Problems”                     4                           40.8%               73.0%    9327
written by Alekseev V.B. as a test data. The text consists                    5                           41.5%               73.6%    7293
of 39519 words 1576 of which are unique. The alphabetical
index at the end of the book was used to evaluate the retrieval      Tab. 4. RAKE evaluations with phrase length limitations
capabilities of different key phrase extractors. The choice of
the test data is explained, firstly, by a large volume of the
text, which allows us to give more accurate estimates of the             The result set contained 4320 phrases and all these phrases
efficiency of algorithms, and secondly, by the requirements of       were unique in terms of word forms. However, we faced the
the statistical approach to a number of occurrences of analyzed      problem: the output contained long phrases, for example, the
words (for small texts this condition may not be fulfilled even      longest one consisted of 10 words. We would like to have
for one word).                                                       shorter phrases. It was decided to set the limit for the maximum
                                                                     phrase length. All phrases with length above the limit were
B. Getting Evaluations                                               split into shorter ones. From words that make up the long
                                                                     phrase, all possible phrases of a given length were made
    We measured the efficiency of algorithms in terms of Recall      without taking into account the word order. Also, the limit
and Precision. The result of each algorithm is a set of phrases.     was set for the maximum degree of each word to support this
To obtain the Precision estimate, we must determine how many         changes.
key phrases are in the result set. We will assume that the phrase
from the result set is a key phrase if it fully contains any key         Next, We studied the dependency of efficiency on the
phrase from the reference set and the order of words does not        restriction on the length of a phrase. As we can see on the Fig.2
count. To obtain the Recall estimate, we must determine how          the highest precision was reached when the maximal phrase
many key phrases from the reference set are in the result set.       length was limited to 4 or 5 words. On Tab. 5 are shown exact
Also, we will assume that the phrase from the reference set          evaluations for RAKE with phrase length limitations.
is in the result set if it is fully contained in any phrase in the       It makes sense to consider only such restrictions on length,
result set without regard to the order of words. In addition,        since at these limitations Precision reaches its highest values,
all the words in all phrases in both sets were stemmed, which        with Recall close to maximum. It can be seen that the number
solves the problem with comparing different forms of a word.         of words extracted by the algorithm has increased dramatically.
                                                                     At the same time, the number of correctly extracted words also
C. RAKE                                                              increased, because there is an increase in Precision.
   The algorithm requires a set of stop-words. We used                   This way of generating candidates, in theory, can cause an
publicly available one from ranks.nl site. To avoid the problem      exponential growth of their number. By the length of a phrase
with different forms of words We used SnowballStemmer                we will understand the number of words it consists of. Let
from the nltk library for Python. Every word in every phrase         k be a restriction on the maximum length of a phrase, m -
was stemmed. For each stemmed phrase was kept the count              the maximum length of a phrase that hit the derivation of the
of unstemmed ones from which it could be obtained. These             algorithm. By the definition of the algorithm, phrases obtained
numbers were used in computing scores for individual words.          after partitioning do not have intersections in the text, i.e. for
Results are shown on Tab. 3:                                         any two different phrases, their positions in the text will not

                                                                     33
                                                                                                                1.0


                        0.60




                                                                                                                0.8

                        0.55



                                                                       phrase_length                                                                                   filter
            precision




                                                                                                    precision
                                                                          4                                                                                                 filtered
                                                                                                                0.6
                        0.50                                              5                                                                                                 unfiltered




                        0.45
                                                                                                                0.4




                        0.40

                                   0     2500            5000   7500                                                   0   1000   2000            3000   4000   5000
                                                nwords                                                                                   nwords




                                                                                                                0.15

                        0.6




                                                                       phrase_length                                                                                   filter
                                                                                                                0.10
            recall




                                                                                                    recall
                                                                          4                                                                                                 filtered
                                                                          5                                                                                                 unfiltered
                        0.4




                                                                                                                0.05




                        0.2


                                                                                                                0.00
                               0        2500             5000   7500                                                   0   1000   2000            3000   4000   5000
                                                nwords                                                                                   nwords




Fig. 3. Comparison of estimates for RAKE with 4 and 5 words                            Fig. 4. Dependence of estimates for statistical approach on the
phrase length limitations. Graph shows dependence on the number                        number of words with the highest ranks extracted when phrase
of words with the highest ranks extracted                                              length is limited to two words


                                                                                                                1.00


overlap. Let N be the number of words in a text. Thus, the
number of such phrases is not greater than m N
                                               . We assume                                                      0.98


that m > k. When generating new phrases, the word order is
not taken into account, so the number of new phrases can be                                         precision
                                                                                                                0.96
                                                                                                                                                                       filter
                                                                                                                                                                            filtered



calculated as the number of combinations without repetitions:
                                                                                                                                                                            unfiltered




                                        k           m!                                                          0.94



                                       Cm =
                                                (m − k)! · k!                                                   0.92




Thus, we can give the following estimate of the new number
                                                                                                                       0   1000   2000            3000   4000   5000
                                                                                                                                         nwords




of phrases:
                                                                                                                0.20




                           N     k
                              · Cm                                                                              0.15

                           m
As we can see, the growth is proportional to the number
                                                                                                                                                                       filter
                                                                                                    recall




                                                                                                                                                                            filtered
                                                                                                                0.10                                                        unfiltered


of combinations. By definition of the algorithm, the length
of the key phrase in average can not be more than the                                                           0.05


average sentence length. The average sentence consists of 10-
11 words[10], which means that the average value of m will
not exceed this number. In this way, we can conclude that the
                                                                                                                       0   1000   2000            3000   4000   5000
                                                                                                                                         nwords




increase in the number of candidates will be within reasonable
                                                                                       Fig. 5. Dependence of estimates for statistical approach on the
limits for random texts.
                                                                                       number of words with the highest ranks extracted when phrase
    Limiting the size of the result set, we can improve Precision                      length is limited to three words
of an algorithm. But discarding part of the results may lead
to the decreasing of Recall. Fig. 3 shows the dependence of
Precision and Recall on the number of selected candidates.
                                                                                       is around 1500, thus there will be around 106 phrases of two
                                                                                       words and around 109 phrases of three words. Actually, there
D. Statistical Approach
                                                                                       is no need to store all candidates as we only need first T of
    The statistical approach provides the way of measuring the                         them with the maximal score.
significance of individual words. Nevertheless, key phrases can
consist of more than one word. In this connection, the question                            Selection of candidates: We will consider the case of se-
arises how to form a set of candidates. The solutions proposed                         lecting phrases from two words. This problem can be addressed
in [4] showed good results, but they require a large amount of                         to the next. Given two sorted arrays A[1 . . . n] and B[1 . . . n].
memory to store all candidates. For example, the number of all                         We want to print all n2 sums A[i]+B[j] of two elements from
possible key phrases formed of two words (without taking into                          different arrays in descending order. There is the solution to
account the order) will be equal to the number of combinations.                        the problem that performs in O(n2 log(n)) time and requires
So, for the text used in the work, the number of unique words                          O(n) space.

                                                                                       34
    Description of the solution of the problem of arrays: We            the RAKE algorithm allows us to create phrases that provide a
will use max-heap to store tuples consist of a sum A[i] +               relatively high Recall, but proposed ranking formula does not
B[j] and a couple of indexes (i, j) which defines such sum.             allow to achieve high Precision while limiting the number of
Elements in the heap are ordered by the first element of a              candidates. In this connection, the idea of using Γ-index for
tuple.                                                                  ranking of a single word, but use phrases that are obtained
                                                                        as a result of the RAKE algorithm. Fig. 6 and Fig. 7 show a
     At the beginning, heap stores all pairs such as: ∀j =
                                                                        comparison of corresponding methods.
0 . . . n : (A[0] + B[j], (0, j)).
    The step of the algorithm is that the first element is                                      1.0




extracted from the heap - this is the element with the largest
sum. Suppose that indexes (i, j) were associated with that sum.                                 0.8


We print extracted sum and then put a new sum in the heap                                                                            method




                                                                                    precision
A[i + 1] + B[j] with the pair of respective indexes. When the
                                                                                                                                        hybrid
                                                                                                                                        rake




elements of the arrays are finished, the remaining content of                                   0.6




the heap is displayed. From the definition of the solution, the
validity of estimates of time and space consumption is obvious.                                 0.4




    Correctness of the solution: To prove the correctness of
                                                                                                      0   2500       5000    7500
                                                                                                                  nwords




the solution it is handy to consider square matrix formed from
all possible sums. We denote such matrix as C and it forms as                                   0.6



follows: C[i, j] = A[i] + B[j]. Since the elements of arrays A
and B are sorted, we can notice that the elements of the matrix                                                                      method




                                                                                    recall
                                                                                                                                        hybrid


are decreasing from left to right and from top to bottom. In                                    0.4                                     rake




other words: ∀i, j : c[i, j] ≥ c[i + 1, j] and c[i, j] ≥ c[i, j + 1].
    At each step, the following invariant is supported: the heap
                                                                                                0.2




stores the maximal element of each column that have not been                                          0   2500       5000    7500



printed yet. Since the maximal element that should get into the
                                                                                                                  nwords




output is in one of the columns, this proves the correctness.           Fig. 6. Comparison of the efficiency of algorithms RAKE and
    Efficiency of the algorithm: Before calculating scores, each        Hybrid when the length of phrases is limited to 4 words. The graph
word was stemmed with SnowballStemmer taken from nltk                   shows dependence of estimates on the number of words with most
                                                                        ranks extracted for considered methods
version 3.2.2. Thus, words that differ only by their forms were
merged into equivalence classes. Then, scores were calculated
for such classes. We used Gamma-index in the experiments.
Also, at the stage of selecting words for weighing, it was                                      1.0




decided to use a filter along the length of the word: those
words which length was less than the specified threshold did                                    0.8



not participate in the weighing and in the construction of                                                                           method
                                                                                    precision




candidates. In the experiments with filtering the threshold value
                                                                                                                                        hybrid
                                                                                                                                        rake




for the word length was set to 3. Experiments were carried out                                  0.6




for phrases of two and three words.
   On Fig. 4 and Fig. 5 is shown that the use of the filter have                                0.4

                                                                                                      0    2000       4000    6000


almost no impact on Recall and Precision of the algorithm.                                                        nwords




    The considered algorithm shows high Precision. This is
due to the construction of candidates. Γ-index accurately ranks
                                                                                                0.6




individual words, hence top-scored ones are most likely to                                                                           method


be actual key phrases or to be contained in multi-word key
                                                                                    recall




                                                                                                                                        hybrid
                                                                                                0.4
                                                                                                                                        rake




phrases. Because of the way of constructing candidates and
due to the definition of efficiency metrics top-scored candidates                               0.2



are most likely to be treated as properly extracted key phrases.
    Based on the results of the experiments, it can be concluded
                                                                                                      0   2000        4000    6000
                                                                                                                  nwords



that the metrics Γ-index correctly ranks individual words.
Nevertheless, artificially constructed phrases can not ensure           Fig. 7. Comparison of the efficiency of algorithms RAKE and
                                                                        Hybrid when the length of phrases is limited to 5 words. The graph
the high Recall of the algorithm.
                                                                        shows dependence of estimates on the number of words with most
                                                                        ranks extracted for considered methods
E. Hybrid approach
    The advantage of the statistical approach is the high Preci-
sion of the ranking of single words. And the main drawback is
the complexity of constructing candidates and incapability to
ensure high Recall with those candidates. On the other hand,

                                                                        35
                     IV.   C ONCLUSION                            5. Juan P. Herrera, Pedro A. Pury. Statistical Keyword Detec-
                                                                     tion in Literary Corpora // The European Physical Journal
    In the work it was shown that, with certain additions, the       B, 2008, pp. 135-146
RAKE algorithm and the statistical approach can be used to
                                                                  6. Xinghua Hu, Bin Wu. Automatic Keyword Exctraction
extract key phrases from Russian texts. In addition, the paper       Using Linguistic Features //Sixth IEEE International Con-
proposed new Hybrid method that uses Γ-index for weighing            ference on Data Mining - Workshops (ICDMW’06), 2006,
phrases that are obtained as a result of RAKE algorithm.             pp. 19-23
And it was shown that the algorithm allows to achieve higher      7. Zhengyang Liu, Jianyi Liu, Wenbin Yao, Cong Wang. Key-
Precision and Recall comparing to other algorithms considered        word Extraction Using PageRank on Synonym Networks //
in the paper.                                                        2010 International Conference on E-Product E-Service and
                                                                     E-Entertainment, 2010, pp. 1-4
                     V.    R EFERENCES                            8. Jones S., Paynter G. Automatic extraction of document
1. Analysis of data and processes[Text] / A.A. Barsegyan,            keyphrases for use in digital libraries: evaluation and appli-
   M.S. Kupriyanov, I.I. Kholod, and others. – BVH-                  cations // Journal of the American Society for Information
   Petesburg, 2009.-510 p.                                           Science and Technology, 2002
2. Michael Berry. Text Mining: Applications and Theory[] /        9. Gutwin C, Paynter G, Witten I, Nevill-Manning C., Frank
   Michael Berry, Jacob Kogan – 2010, John Wiley and Sons,           E. Improving browsing in digital libraries with keyphrase
   Ltd.-205 p.                                                       indexes // Decision Support Systems 27(12), 1999, pp. 81-
3. Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into          104
   Texts // In Proceedings of EMNLP 2004 (ed. Lin D and          10. S.A. Sharov. Frequency dictionary [Online]. Avail-
   Wu D), pp. 404-411                                                able: http://www.artint.ru/projects/frqlist.php [Accessed:
4. Siddiqi, S., Sharan, A. Keyword and keyphrase extraction          20.05.2017]
   from single Hindi document using statistical approach //
   2015 2nd International Conference on Signal Processing
   and Integrated Networks (SPIN) 2015, pp. 713-718




                                                                 36