=Paper= {{Paper |id=Vol-2135/SEIM_2018_paper_30 |storemode=property |title=Keyword extraction from single Russian document |pdfUrl=https://ceur-ws.org/Vol-2135/SEIM_2018_paper_30.pdf |volume=Vol-2135 |authors=Mikhail Sandul,Elena Mikhailova }} ==Keyword extraction from single Russian document== https://ceur-ws.org/Vol-2135/SEIM_2018_paper_30.pdf

Keyword Extraction from Single Russian Document

Mikhail Vadimovich Sandul Elena Georgievna Mikhailova
ITMO University Saint Petersburg State University, ITMO University
St. Petersburg, Russia St. Petersburg, Russia
sandulmv@yandex.ru e.mikhaylova@spbu.ru

Abstract—The problem of automatic keyword and phrases building index. They are also used to enrich the presentation
extraction from a text occurs in different tasks of information o search results.
retrieval and text mining. The task is the identification of terms
that best describe the subject of a document. Currently there are Despite the wide range of applications, most documents
a lot of research to solve this problem. Basically, algorithms are do not have assigned keywords. Most approaches are focused
developed for texts in English. The possibility of applying these on manual assignment of keywords. Usually, this procedure is
algorithms to the Russian texts are not sufficiently investigated. done by specialists in the relevant field. In their work, they may
One of the most known algorithms for solving the problem use a fixed taxonomy or rely on author’s judgment to provide
of keyword extraction is RAKE. This article examines the a representative list[2]. The main goal of further investigations
effectiveness of RAKE algorithm for texts in Russian. The work
is automatization of this process.
also applies the hybrid method, which uses the Γ-index metric
for phrases weighting, which were obtained using the algorithm Early approaches to automatic keyword extraction focus
RAKE. The article shows that this algorithm is more accurate on corpus-oriented statistics of individual words. However,
than PAKE while reducing the number of selected phrases.
they have some flaws. For instance, while some words might
be evaluated as keywords within the whole corpus, keywords
I. I NTRODUCTION within a single document or several documents might not[2].
Also, such methods can not help us in finding keywords
Organizing data into a database requires expenses on
consists of two or more words. To avoid these drawbacks, we
database designing, data organizing, etc. These expenses are
focus on approaches that operate on a single document.
not necessarily related to the analysis of the data itself. So,
we spend additional time and effort on preprocessing to get We will describe keyword as a sequence of one or more
new knowledge from a data. The worst part is that organizing words which provide a representation of the document’s con-
some types of a data into a database can lead to a loss of the tent. An algorithm should return us a list of such sequences.
content. Usually it’s impossible to convert text documents into Ideally, this list represents in condense for the essential content
a table (in case of relational DB) without loss of its meanings. of a document. However, there is no need for these sequences
Thus, texts are usually stored unchanged as BLOB-fields. As to form coherent text. To avoid ambiguity, further in the article
a result, we can say that using databases for analysing texts is we will refer to such sequences as key phrases to highlight
not efficient. that they may contain more than one word. We will use the
term ”keyword” only to emphasize that we are dealing with a
As we can see, it is hard to bring structure into a text for
single-word sequence.
analysis. However, we would like to extract useful information
from it anyway. The branch of science which deals with such To evaluate the efficiency of an algorithm we use Precision
problems is called Text Mining[1]. and Recall metrics. Recall is the fraction of correctly extracted
Nowadays a lot of real-world problems are subjects of keywords among all keywords in a document:
study of Text Mining starting from typical Data Mining correctly extracted
problems such as classification and clustering to exclusively all keywords in a text
Text Mining problems like automatic keyword extraction and
annotation[1]. Precision is the fraction of correctly extracted keywords to a
count among all extracted keywords:
Currently, amount of unstructured text information con-
correctly extracted
stantly grows. Keywords can help to get new knowledge from
it because they let us understand the meaning of a text without all extracted
reading it. Automatic keyword extraction is the subject of study
of Text Mining. Different approaches to automatic keyword extraction exist:
supervised and unsupervised machine learning algorithms,
Keywords can be applied to improve the functionality of statistical approaches, approaches based on linguistic features.
information retrieval systems. For example, Phrasier[8] system They all can be used to solve our problem. These methods can
uses keywords to find documents related to the primary one be divided into groups if they domain-dependent or domain-
and words themselves serve as links between documents. That independent, corpus-oriented or made for single documents,
allows a user to navigate within documents much faster. An- require training set or not. In this particular work will be
other example is Keyphind[9]. It is the search engine for digital considered domain-independent methods which do not require
libraries. Keywords are used there as the main source for training set.

30
II. R ELATED W ORK AND BACKGROUND It should be noted, that due to algorithm definition, the
number of vertexes in a graph depends on the number of
A. TextRank
distinct words in a text. So, for large texts algorithm can be
TextRank[3] is the agile graph-based method that can be very long to run.
used not only for key phrase extraction but also for creating
annotation for a text. Precision 32.1%
Recall 36.2%
Representing data as a graph is one of the approaches to
retrieve information. The basic idea implemented by a graph- Tab. 1. TextRank evaluations
based ranking model is that of ”voting” or ”recommendations”.
When there an edge exists from one vertex to another, it is
basically casting a vote for that other vertex. The more votes B. PageRank on Synonym Networks
a vertex gets, the more important it becomes. Moreover, an
importance of the vertex determines the importance of its The problem of a graph growth can be partially solved
”votes”, and this information is also taken into account by by merging words into groups. This can also improve the
ranking model. efficiency of the algorithm. One way to form groups is to define
equivalence classes. Authors suggest merging words by their
Formally, let G = (V, E) be a directed graph with the set meaning. PageRank on Synonym Networks[7] implements
of vertexes V and the set of edges E, where E is the subset of this idea. They need a thesaurus for that matter. Each class
V × V . Let In(Vi ) be a set of vertexes which has an edge that can be represented by a unique number. Every word in a
goes to Vi . Let Out(Vi ) be a set of vertexes to which leads an text is replaced by a unique number that corresponds to the
edge from Vi . We will define a score of a vertex Vi as follows: equivalence class of the word. Then the algorithm described
∑ in the previous paragraph is applied to the modified text.
S(Vi ) = (1 − d) + d · S(Vj )
Vj ∈In(Vi ) C. RAKE
where d ∈ [0; 1] is a dumping factor, which has a role of inte- RAKE[2] (Rapid Automatic Keyword Extraction) is a
grating into the model the probability to jump from the given relatively effective algorithm that operates on individual docu-
vertex to another random one in the graph. The algorithm starts ments, easily applied to new domains and does not depend
with arbitrary values of S(Vi ) for each node, the computation on the structure of a document or specific grammar rules.
iterates until convergence below given threshold is achieved. RAKE is based on the observation that key phrases frequently
To apply such method to a text, first, we need to introduce contain multiple words but rarely contain standard punctuation
some rules according to which we will build a graph. We need or stop-words. As a stop-words, we denote function words
to define units which will be treated as nodes and relations like ”and”, ”of”, ”the”, or other words with minimal lexical
between them. We can distinguish the following steps: meaning. It is worth to mention that such words are also
ignored in building an index in information retrieval systems.
1) Identify text units that best define the task at hand, and The algorithm requires the list of such words.
add them as vertexes in the graph
2) Identify relations that connect such text units, and add In addition to a stop-word list, sets of phrase- and
them as edges between respective vertexes word-delimiters needed. RAKE uses stop-words and phrase-
3) Iterate graph-based algorithm until it converges delimiters to partition a document into a set of candidate
4) Sort vertexes by their score and use this score for selection key phrases. Word delimiters are used to split candidates into
decisions individual words.
RAKE begins key phrase extraction with partitioning a text
Such definition allows us to apply this method to a wide
into candidate key phrases. On the next step, the set of unique
range of problems. In particular, we will show how to apply
words is obtained by splitting candidates into individual words
this algorithm to key phrase extraction.
with word-delimiters. After this, scores are calculated for
First, we have to decide what to use as units when building each word. Authors propose metrics based on word frequency
the graph. Authors suggested to this role nouns and adjectives. and word degree. Word frequency (f req(w)) is a number of
They tried different part-of-speech filters but this one showed occurrences of the word in candidates. Word degree (deg(w))
the highest performance. Two lexical units were related if they is a total length of all candidates which contain the word. The
co-occur within a window of maximum N words, where N can score of a word is defined as
be set anywhere from 2 to 10. As a result, authors constructed deg(w)
an unweighted and undirected graph. The initial score for s(w) =
f req(w)
each vertex was set to 1. After this, authors ran the algorithm
described earlier. Once a final score was obtained for each Then a score is calculated for each candidate key phrase and
vertex in the graph, first T vertexes with the highest scores defined as the sum of its member word scores.
were selected. T may be set to any fixed value or may depend
Authors evaluated the efficiency of the algorithm on the
on various factors for instance, on a text size. Then, sequences
Inspec set, the set of 500 abstracts with manually assigned
of adjacent keywords are collapsed into a key phrases.
key phrases. The same dataset was used in testing TextRank
The dataset used in the experiments is a collection of effectiveness. They also compared performance of the algo-
500 abstracts from the Inspec database, and the corresponding rithm with different stop-word lists. The results are shown on
manually assigned key phrases. Results are shown on Tab. 1. the Tab. 2.

31
As we can see, the difference in the Precision with gener-
ated stop-word list based on keyword adjacency (KA stoplist)
and with Fox’s stop-word list is quite noticeable. This means
that RAKE is strongly depends on the provided set of stop-
words.
stop-word list Precision Recall
KA stop-word list 33.7% 41.5%
Fox stop-word list 26.0% 42.2%
Tab. 2. RAKE evaluations

D. Position Weighing Fig. 1. Distribution of distances to the closest neighbor
Position Weighing[6] is based on the fact that a word’s
position plays an important role in linguistics. Words in
different positions carry different entropy. For example, if a
word appears in an introduction or in a conclusion it usually 2) σ-index: To study the spatial distribution of a given
carries more information. This observation also applies to word in a text, first, we denote by ti the absolute position in
words within the same paragraph: words appeared in a leading the corpus of the i−th occurrence of a word. Thus we obtain a
and summarization sentences are often more important than sequence of such positions: {t0 , t1 , . . . , tn+1 }, assuming there
those in other positions of the same paragraph. are n + 2 occurrences. Next, we need to compute the average
distance between two successive words:
Authors propose their approach called Position Weight. For n
each word position they suggest computing a score which 1 ∑
µ= (ti+1 − ti )
strongly depends on a paragraph and a sentence where the n + 1 i=0
word occurrence is found and on a form of the word. Here is
the formula: The following step is to obtain the standard deviation:
v
pw(ti ) = pw(ti , pj ) ∗ pw(ti , sk ) ∗ pw(ti , wr ) u n
u 1 ∑
s=t ((ti+1 − ti ) − µ)2
Where pw(ti , pj ) represents the score of an occurrence ti n + 1 i=0
within the paragraph j; pw(ti , sk ) represents the score of an
occurrence ti within the sentence k; pw(ti , wr ) represents the To eliminate the dependence on the frequency of occurrence
score of an occurrence ti as a word for r. Then the total weight for different words, authors suggest normalizing the token
of a term t in a document d computed as the sum of all position spacing, thus they define σ-index as follows:
scores:
∑m s/µ
P W (t, d) = pw(ti )
i=1
Given that standard deviation grows rapidly when inhomogene-
ity of a distribution spacing ti+1 − ti increases. However, a
We can conclude from the definition of the algorithm that value of sigma can be strongly affected by the change of a
it relies on the structure of a document. Also, we should notice single occurrence position. Also, high values of sigma do not
when assumptions about a document structure are violated (for always imply a cluster concentration.
example, there are no clues to determine a type of sentences
and paragraphs), the algorithm is reduced to simply calculating 3) Γ-index: To solve such problems, authors suggest con-
the frequency of a word. sidering the average separation around the occurrence at ti .
They define it as follows:
E. Statistical Approach (ti+1 − ti ) − (ti − ti−1 ) ti+1 − ti−1
d(ti ) = =
1) Weighing of Individual Words: In addition to the fre- 2 2
quency of a word, one can also obtain information about . For each position ti the cluster index γ is computed:
the distribution of a word. The authors of the approach { µ−d(ti )
suggest using this information to determine the significance , if d(ti ) < µ
γ(ti ) = µ
of individual words. The metrics proposed in [5] are based 0, else
on the phenomenon of the attraction of keywords in the text.
Finally, the score of a word is obtained from the average of
So, for example, the authors give the distribution of distances
all its cluster indexes:
to the nearest neighbor for the word ”NATURE” in the book
n
”The Origin of Species” by Charles Darwin (Fig. 1). On the 1∑
graph, you can see that the distribution of distances resembles Γ= γ(ti )
n i=1
an exponential distribution. In the case of noise words, this
property is much weaker. Thus, keywords tend to form clusters Γ-index is more stable than σ-index. However, it is more time-
in a text unlike noise words. consuming to evaluate than σ.

32
Precision 31.8% 0.4

Recall 74.2%
Tab. 3. Evaluations for RAKE without any modification
0.3

precision
0.2

4) Extracting of Key Phrases: The metrics described above 0.1

makes it possible to rank only individual words. In [4], pro-
posed several ways to generalize metrics to two-word phrases: 2.5 5.0
length
7.5 10.0

First Way: Rank whole word sequences. The weight of the
sequence is calculated similarly to a single word. The problem 0.6

is that the frequency of the desired phrases may be insufficient
for analysis.

recall
0.4

Second way: Make up all possible unique phrases of a
given length. To compute the score of a phrase we need to 0.2

sum up all score of individual words from which the phrase
consists. The problem with this approach is that the number 2.5 5.0
length
7.5 10.0

of phrases grows exponentially with the given length.
Fig. 2. The dependence of the efficiency on the restriction on the
length of a phrase
III. E XPERIMENTS
A. Data Description
Max. length Precision Recall NWords
We used the book ”Abel Theorem in Solutions Problems” 4 40.8% 73.0% 9327
written by Alekseev V.B. as a test data. The text consists 5 41.5% 73.6% 7293
of 39519 words 1576 of which are unique. The alphabetical
index at the end of the book was used to evaluate the retrieval Tab. 4. RAKE evaluations with phrase length limitations
capabilities of different key phrase extractors. The choice of
the test data is explained, firstly, by a large volume of the
text, which allows us to give more accurate estimates of the The result set contained 4320 phrases and all these phrases
efficiency of algorithms, and secondly, by the requirements of were unique in terms of word forms. However, we faced the
the statistical approach to a number of occurrences of analyzed problem: the output contained long phrases, for example, the
words (for small texts this condition may not be fulfilled even longest one consisted of 10 words. We would like to have
for one word). shorter phrases. It was decided to set the limit for the maximum
phrase length. All phrases with length above the limit were
B. Getting Evaluations split into shorter ones. From words that make up the long
phrase, all possible phrases of a given length were made
We measured the efficiency of algorithms in terms of Recall without taking into account the word order. Also, the limit
and Precision. The result of each algorithm is a set of phrases. was set for the maximum degree of each word to support this
To obtain the Precision estimate, we must determine how many changes.
key phrases are in the result set. We will assume that the phrase
from the result set is a key phrase if it fully contains any key Next, We studied the dependency of efficiency on the
phrase from the reference set and the order of words does not restriction on the length of a phrase. As we can see on the Fig.2
count. To obtain the Recall estimate, we must determine how the highest precision was reached when the maximal phrase
many key phrases from the reference set are in the result set. length was limited to 4 or 5 words. On Tab. 5 are shown exact
Also, we will assume that the phrase from the reference set evaluations for RAKE with phrase length limitations.
is in the result set if it is fully contained in any phrase in the It makes sense to consider only such restrictions on length,
result set without regard to the order of words. In addition, since at these limitations Precision reaches its highest values,
all the words in all phrases in both sets were stemmed, which with Recall close to maximum. It can be seen that the number
solves the problem with comparing different forms of a word. of words extracted by the algorithm has increased dramatically.
At the same time, the number of correctly extracted words also
C. RAKE increased, because there is an increase in Precision.
The algorithm requires a set of stop-words. We used This way of generating candidates, in theory, can cause an
publicly available one from ranks.nl site. To avoid the problem exponential growth of their number. By the length of a phrase
with different forms of words We used SnowballStemmer we will understand the number of words it consists of. Let
from the nltk library for Python. Every word in every phrase k be a restriction on the maximum length of a phrase, m -
was stemmed. For each stemmed phrase was kept the count the maximum length of a phrase that hit the derivation of the
of unstemmed ones from which it could be obtained. These algorithm. By the definition of the algorithm, phrases obtained
numbers were used in computing scores for individual words. after partitioning do not have intersections in the text, i.e. for
Results are shown on Tab. 3: any two different phrases, their positions in the text will not

33
1.0

0.60

0.8

0.55

phrase_length filter
precision

precision
4 filtered
0.6
0.50 5 unfiltered

0.45
0.4

0.40

0 2500 5000 7500 0 1000 2000 3000 4000 5000
nwords nwords

0.15

0.6

phrase_length filter
0.10
recall

recall
4 filtered
5 unfiltered
0.4

0.05

0.2

0.00
0 2500 5000 7500 0 1000 2000 3000 4000 5000
nwords nwords

Fig. 3. Comparison of estimates for RAKE with 4 and 5 words Fig. 4. Dependence of estimates for statistical approach on the
phrase length limitations. Graph shows dependence on the number number of words with the highest ranks extracted when phrase
of words with the highest ranks extracted length is limited to two words

1.00

overlap. Let N be the number of words in a text. Thus, the
number of such phrases is not greater than m N
. We assume 0.98

that m > k. When generating new phrases, the word order is
not taken into account, so the number of new phrases can be precision
0.96
filter
filtered

calculated as the number of combinations without repetitions:
unfiltered

k m! 0.94

Cm =
(m − k)! · k! 0.92

Thus, we can give the following estimate of the new number
0 1000 2000 3000 4000 5000
nwords

of phrases:
0.20

N k
· Cm 0.15

m
As we can see, the growth is proportional to the number
filter
recall

filtered
0.10 unfiltered

of combinations. By definition of the algorithm, the length
of the key phrase in average can not be more than the 0.05

average sentence length. The average sentence consists of 10-
11 words[10], which means that the average value of m will
not exceed this number. In this way, we can conclude that the
0 1000 2000 3000 4000 5000
nwords

increase in the number of candidates will be within reasonable
Fig. 5. Dependence of estimates for statistical approach on the
limits for random texts.
number of words with the highest ranks extracted when phrase
Limiting the size of the result set, we can improve Precision length is limited to three words
of an algorithm. But discarding part of the results may lead
to the decreasing of Recall. Fig. 3 shows the dependence of
Precision and Recall on the number of selected candidates.
is around 1500, thus there will be around 106 phrases of two
words and around 109 phrases of three words. Actually, there
D. Statistical Approach
is no need to store all candidates as we only need first T of
The statistical approach provides the way of measuring the them with the maximal score.
significance of individual words. Nevertheless, key phrases can
consist of more than one word. In this connection, the question Selection of candidates: We will consider the case of se-
arises how to form a set of candidates. The solutions proposed lecting phrases from two words. This problem can be addressed
in [4] showed good results, but they require a large amount of to the next. Given two sorted arrays A[1 . . . n] and B[1 . . . n].
memory to store all candidates. For example, the number of all We want to print all n2 sums A[i]+B[j] of two elements from
possible key phrases formed of two words (without taking into different arrays in descending order. There is the solution to
account the order) will be equal to the number of combinations. the problem that performs in O(n2 log(n)) time and requires
So, for the text used in the work, the number of unique words O(n) space.

34
Description of the solution of the problem of arrays: We the RAKE algorithm allows us to create phrases that provide a
will use max-heap to store tuples consist of a sum A[i] + relatively high Recall, but proposed ranking formula does not
B[j] and a couple of indexes (i, j) which defines such sum. allow to achieve high Precision while limiting the number of
Elements in the heap are ordered by the first element of a candidates. In this connection, the idea of using Γ-index for
tuple. ranking of a single word, but use phrases that are obtained
as a result of the RAKE algorithm. Fig. 6 and Fig. 7 show a
At the beginning, heap stores all pairs such as: ∀j =
comparison of corresponding methods.
0 . . . n : (A[0] + B[j], (0, j)).
The step of the algorithm is that the first element is 1.0

extracted from the heap - this is the element with the largest
sum. Suppose that indexes (i, j) were associated with that sum. 0.8

We print extracted sum and then put a new sum in the heap method

precision
A[i + 1] + B[j] with the pair of respective indexes. When the
hybrid
rake

elements of the arrays are finished, the remaining content of 0.6

the heap is displayed. From the definition of the solution, the
validity of estimates of time and space consumption is obvious. 0.4

Correctness of the solution: To prove the correctness of
0 2500 5000 7500
nwords

the solution it is handy to consider square matrix formed from
all possible sums. We denote such matrix as C and it forms as 0.6

follows: C[i, j] = A[i] + B[j]. Since the elements of arrays A
and B are sorted, we can notice that the elements of the matrix method

recall
hybrid

are decreasing from left to right and from top to bottom. In 0.4 rake

other words: ∀i, j : c[i, j] ≥ c[i + 1, j] and c[i, j] ≥ c[i, j + 1].
At each step, the following invariant is supported: the heap
0.2

stores the maximal element of each column that have not been 0 2500 5000 7500

printed yet. Since the maximal element that should get into the
nwords

output is in one of the columns, this proves the correctness. Fig. 6. Comparison of the efficiency of algorithms RAKE and
Efficiency of the algorithm: Before calculating scores, each Hybrid when the length of phrases is limited to 4 words. The graph
word was stemmed with SnowballStemmer taken from nltk shows dependence of estimates on the number of words with most
ranks extracted for considered methods
version 3.2.2. Thus, words that differ only by their forms were
merged into equivalence classes. Then, scores were calculated
for such classes. We used Gamma-index in the experiments.
Also, at the stage of selecting words for weighing, it was 1.0

decided to use a filter along the length of the word: those
words which length was less than the specified threshold did 0.8

not participate in the weighing and in the construction of method
precision

candidates. In the experiments with filtering the threshold value
hybrid
rake

for the word length was set to 3. Experiments were carried out 0.6

for phrases of two and three words.
On Fig. 4 and Fig. 5 is shown that the use of the filter have 0.4

0 2000 4000 6000

almost no impact on Recall and Precision of the algorithm. nwords

The considered algorithm shows high Precision. This is
due to the construction of candidates. Γ-index accurately ranks
0.6

individual words, hence top-scored ones are most likely to method

be actual key phrases or to be contained in multi-word key
recall

hybrid
0.4
rake

phrases. Because of the way of constructing candidates and
due to the definition of efficiency metrics top-scored candidates 0.2

are most likely to be treated as properly extracted key phrases.
Based on the results of the experiments, it can be concluded
0 2000 4000 6000
nwords

that the metrics Γ-index correctly ranks individual words.
Nevertheless, artificially constructed phrases can not ensure Fig. 7. Comparison of the efficiency of algorithms RAKE and
Hybrid when the length of phrases is limited to 5 words. The graph
the high Recall of the algorithm.
shows dependence of estimates on the number of words with most
ranks extracted for considered methods
E. Hybrid approach
The advantage of the statistical approach is the high Preci-
sion of the ranking of single words. And the main drawback is
the complexity of constructing candidates and incapability to
ensure high Recall with those candidates. On the other hand,

35
IV. C ONCLUSION 5. Juan P. Herrera, Pedro A. Pury. Statistical Keyword Detec-
tion in Literary Corpora // The European Physical Journal
In the work it was shown that, with certain additions, the B, 2008, pp. 135-146
RAKE algorithm and the statistical approach can be used to
6. Xinghua Hu, Bin Wu. Automatic Keyword Exctraction
extract key phrases from Russian texts. In addition, the paper Using Linguistic Features //Sixth IEEE International Con-
proposed new Hybrid method that uses Γ-index for weighing ference on Data Mining - Workshops (ICDMW’06), 2006,
phrases that are obtained as a result of RAKE algorithm. pp. 19-23
And it was shown that the algorithm allows to achieve higher 7. Zhengyang Liu, Jianyi Liu, Wenbin Yao, Cong Wang. Key-
Precision and Recall comparing to other algorithms considered word Extraction Using PageRank on Synonym Networks //
in the paper. 2010 International Conference on E-Product E-Service and
E-Entertainment, 2010, pp. 1-4
V. R EFERENCES 8. Jones S., Paynter G. Automatic extraction of document
1. Analysis of data and processes[Text] / A.A. Barsegyan, keyphrases for use in digital libraries: evaluation and appli-
M.S. Kupriyanov, I.I. Kholod, and others. – BVH- cations // Journal of the American Society for Information
Petesburg, 2009.-510 p. Science and Technology, 2002
2. Michael Berry. Text Mining: Applications and Theory[] / 9. Gutwin C, Paynter G, Witten I, Nevill-Manning C., Frank
Michael Berry, Jacob Kogan – 2010, John Wiley and Sons, E. Improving browsing in digital libraries with keyphrase
Ltd.-205 p. indexes // Decision Support Systems 27(12), 1999, pp. 81-
3. Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into 104
Texts // In Proceedings of EMNLP 2004 (ed. Lin D and 10. S.A. Sharov. Frequency dictionary [Online]. Avail-
Wu D), pp. 404-411 able: http://www.artint.ru/projects/frqlist.php [Accessed:
4. Siddiqi, S., Sharan, A. Keyword and keyphrase extraction 20.05.2017]
from single Hindi document using statistical approach //
2015 2nd International Conference on Signal Processing
and Integrated Networks (SPIN) 2015, pp. 713-718