=Paper=
{{Paper
|id=Vol-2135/SEIM_2018_paper_30
|storemode=property
|title=Keyword extraction from single Russian document
|pdfUrl=https://ceur-ws.org/Vol-2135/SEIM_2018_paper_30.pdf
|volume=Vol-2135
|authors=Mikhail Sandul,Elena Mikhailova
}}
==Keyword extraction from single Russian document==
Keyword Extraction from Single Russian Document Mikhail Vadimovich Sandul Elena Georgievna Mikhailova ITMO University Saint Petersburg State University, ITMO University St. Petersburg, Russia St. Petersburg, Russia sandulmv@yandex.ru e.mikhaylova@spbu.ru Abstract—The problem of automatic keyword and phrases building index. They are also used to enrich the presentation extraction from a text occurs in different tasks of information o search results. retrieval and text mining. The task is the identification of terms that best describe the subject of a document. Currently there are Despite the wide range of applications, most documents a lot of research to solve this problem. Basically, algorithms are do not have assigned keywords. Most approaches are focused developed for texts in English. The possibility of applying these on manual assignment of keywords. Usually, this procedure is algorithms to the Russian texts are not sufficiently investigated. done by specialists in the relevant field. In their work, they may One of the most known algorithms for solving the problem use a fixed taxonomy or rely on author’s judgment to provide of keyword extraction is RAKE. This article examines the a representative list[2]. The main goal of further investigations effectiveness of RAKE algorithm for texts in Russian. The work is automatization of this process. also applies the hybrid method, which uses the Γ-index metric for phrases weighting, which were obtained using the algorithm Early approaches to automatic keyword extraction focus RAKE. The article shows that this algorithm is more accurate on corpus-oriented statistics of individual words. However, than PAKE while reducing the number of selected phrases. they have some flaws. For instance, while some words might be evaluated as keywords within the whole corpus, keywords I. I NTRODUCTION within a single document or several documents might not[2]. Also, such methods can not help us in finding keywords Organizing data into a database requires expenses on consists of two or more words. To avoid these drawbacks, we database designing, data organizing, etc. These expenses are focus on approaches that operate on a single document. not necessarily related to the analysis of the data itself. So, we spend additional time and effort on preprocessing to get We will describe keyword as a sequence of one or more new knowledge from a data. The worst part is that organizing words which provide a representation of the document’s con- some types of a data into a database can lead to a loss of the tent. An algorithm should return us a list of such sequences. content. Usually it’s impossible to convert text documents into Ideally, this list represents in condense for the essential content a table (in case of relational DB) without loss of its meanings. of a document. However, there is no need for these sequences Thus, texts are usually stored unchanged as BLOB-fields. As to form coherent text. To avoid ambiguity, further in the article a result, we can say that using databases for analysing texts is we will refer to such sequences as key phrases to highlight not efficient. that they may contain more than one word. We will use the term ”keyword” only to emphasize that we are dealing with a As we can see, it is hard to bring structure into a text for single-word sequence. analysis. However, we would like to extract useful information from it anyway. The branch of science which deals with such To evaluate the efficiency of an algorithm we use Precision problems is called Text Mining[1]. and Recall metrics. Recall is the fraction of correctly extracted Nowadays a lot of real-world problems are subjects of keywords among all keywords in a document: study of Text Mining starting from typical Data Mining correctly extracted problems such as classification and clustering to exclusively all keywords in a text Text Mining problems like automatic keyword extraction and annotation[1]. Precision is the fraction of correctly extracted keywords to a count among all extracted keywords: Currently, amount of unstructured text information con- correctly extracted stantly grows. Keywords can help to get new knowledge from it because they let us understand the meaning of a text without all extracted reading it. Automatic keyword extraction is the subject of study of Text Mining. Different approaches to automatic keyword extraction exist: supervised and unsupervised machine learning algorithms, Keywords can be applied to improve the functionality of statistical approaches, approaches based on linguistic features. information retrieval systems. For example, Phrasier[8] system They all can be used to solve our problem. These methods can uses keywords to find documents related to the primary one be divided into groups if they domain-dependent or domain- and words themselves serve as links between documents. That independent, corpus-oriented or made for single documents, allows a user to navigate within documents much faster. An- require training set or not. In this particular work will be other example is Keyphind[9]. It is the search engine for digital considered domain-independent methods which do not require libraries. Keywords are used there as the main source for training set. 30 II. R ELATED W ORK AND BACKGROUND It should be noted, that due to algorithm definition, the number of vertexes in a graph depends on the number of A. TextRank distinct words in a text. So, for large texts algorithm can be TextRank[3] is the agile graph-based method that can be very long to run. used not only for key phrase extraction but also for creating annotation for a text. Precision 32.1% Recall 36.2% Representing data as a graph is one of the approaches to retrieve information. The basic idea implemented by a graph- Tab. 1. TextRank evaluations based ranking model is that of ”voting” or ”recommendations”. When there an edge exists from one vertex to another, it is basically casting a vote for that other vertex. The more votes B. PageRank on Synonym Networks a vertex gets, the more important it becomes. Moreover, an importance of the vertex determines the importance of its The problem of a graph growth can be partially solved ”votes”, and this information is also taken into account by by merging words into groups. This can also improve the ranking model. efficiency of the algorithm. One way to form groups is to define equivalence classes. Authors suggest merging words by their Formally, let G = (V, E) be a directed graph with the set meaning. PageRank on Synonym Networks[7] implements of vertexes V and the set of edges E, where E is the subset of this idea. They need a thesaurus for that matter. Each class V × V . Let In(Vi ) be a set of vertexes which has an edge that can be represented by a unique number. Every word in a goes to Vi . Let Out(Vi ) be a set of vertexes to which leads an text is replaced by a unique number that corresponds to the edge from Vi . We will define a score of a vertex Vi as follows: equivalence class of the word. Then the algorithm described ∑ in the previous paragraph is applied to the modified text. S(Vi ) = (1 − d) + d · S(Vj ) Vj ∈In(Vi ) C. RAKE where d ∈ [0; 1] is a dumping factor, which has a role of inte- RAKE[2] (Rapid Automatic Keyword Extraction) is a grating into the model the probability to jump from the given relatively effective algorithm that operates on individual docu- vertex to another random one in the graph. The algorithm starts ments, easily applied to new domains and does not depend with arbitrary values of S(Vi ) for each node, the computation on the structure of a document or specific grammar rules. iterates until convergence below given threshold is achieved. RAKE is based on the observation that key phrases frequently To apply such method to a text, first, we need to introduce contain multiple words but rarely contain standard punctuation some rules according to which we will build a graph. We need or stop-words. As a stop-words, we denote function words to define units which will be treated as nodes and relations like ”and”, ”of”, ”the”, or other words with minimal lexical between them. We can distinguish the following steps: meaning. It is worth to mention that such words are also ignored in building an index in information retrieval systems. 1) Identify text units that best define the task at hand, and The algorithm requires the list of such words. add them as vertexes in the graph 2) Identify relations that connect such text units, and add In addition to a stop-word list, sets of phrase- and them as edges between respective vertexes word-delimiters needed. RAKE uses stop-words and phrase- 3) Iterate graph-based algorithm until it converges delimiters to partition a document into a set of candidate 4) Sort vertexes by their score and use this score for selection key phrases. Word delimiters are used to split candidates into decisions individual words. RAKE begins key phrase extraction with partitioning a text Such definition allows us to apply this method to a wide into candidate key phrases. On the next step, the set of unique range of problems. In particular, we will show how to apply words is obtained by splitting candidates into individual words this algorithm to key phrase extraction. with word-delimiters. After this, scores are calculated for First, we have to decide what to use as units when building each word. Authors propose metrics based on word frequency the graph. Authors suggested to this role nouns and adjectives. and word degree. Word frequency (f req(w)) is a number of They tried different part-of-speech filters but this one showed occurrences of the word in candidates. Word degree (deg(w)) the highest performance. Two lexical units were related if they is a total length of all candidates which contain the word. The co-occur within a window of maximum N words, where N can score of a word is defined as be set anywhere from 2 to 10. As a result, authors constructed deg(w) an unweighted and undirected graph. The initial score for s(w) = f req(w) each vertex was set to 1. After this, authors ran the algorithm described earlier. Once a final score was obtained for each Then a score is calculated for each candidate key phrase and vertex in the graph, first T vertexes with the highest scores defined as the sum of its member word scores. were selected. T may be set to any fixed value or may depend Authors evaluated the efficiency of the algorithm on the on various factors for instance, on a text size. Then, sequences Inspec set, the set of 500 abstracts with manually assigned of adjacent keywords are collapsed into a key phrases. key phrases. The same dataset was used in testing TextRank The dataset used in the experiments is a collection of effectiveness. They also compared performance of the algo- 500 abstracts from the Inspec database, and the corresponding rithm with different stop-word lists. The results are shown on manually assigned key phrases. Results are shown on Tab. 1. the Tab. 2. 31 As we can see, the difference in the Precision with gener- ated stop-word list based on keyword adjacency (KA stoplist) and with Fox’s stop-word list is quite noticeable. This means that RAKE is strongly depends on the provided set of stop- words. stop-word list Precision Recall KA stop-word list 33.7% 41.5% Fox stop-word list 26.0% 42.2% Tab. 2. RAKE evaluations D. Position Weighing Fig. 1. Distribution of distances to the closest neighbor Position Weighing[6] is based on the fact that a word’s position plays an important role in linguistics. Words in different positions carry different entropy. For example, if a word appears in an introduction or in a conclusion it usually 2) σ-index: To study the spatial distribution of a given carries more information. This observation also applies to word in a text, first, we denote by ti the absolute position in words within the same paragraph: words appeared in a leading the corpus of the i−th occurrence of a word. Thus we obtain a and summarization sentences are often more important than sequence of such positions: {t0 , t1 , . . . , tn+1 }, assuming there those in other positions of the same paragraph. are n + 2 occurrences. Next, we need to compute the average distance between two successive words: Authors propose their approach called Position Weight. For n each word position they suggest computing a score which 1 ∑ µ= (ti+1 − ti ) strongly depends on a paragraph and a sentence where the n + 1 i=0 word occurrence is found and on a form of the word. Here is the formula: The following step is to obtain the standard deviation: v pw(ti ) = pw(ti , pj ) ∗ pw(ti , sk ) ∗ pw(ti , wr ) u n u 1 ∑ s=t ((ti+1 − ti ) − µ)2 Where pw(ti , pj ) represents the score of an occurrence ti n + 1 i=0 within the paragraph j; pw(ti , sk ) represents the score of an occurrence ti within the sentence k; pw(ti , wr ) represents the To eliminate the dependence on the frequency of occurrence score of an occurrence ti as a word for r. Then the total weight for different words, authors suggest normalizing the token of a term t in a document d computed as the sum of all position spacing, thus they define σ-index as follows: scores: ∑m s/µ P W (t, d) = pw(ti ) i=1 Given that standard deviation grows rapidly when inhomogene- ity of a distribution spacing ti+1 − ti increases. However, a We can conclude from the definition of the algorithm that value of sigma can be strongly affected by the change of a it relies on the structure of a document. Also, we should notice single occurrence position. Also, high values of sigma do not when assumptions about a document structure are violated (for always imply a cluster concentration. example, there are no clues to determine a type of sentences and paragraphs), the algorithm is reduced to simply calculating 3) Γ-index: To solve such problems, authors suggest con- the frequency of a word. sidering the average separation around the occurrence at ti . They define it as follows: E. Statistical Approach (ti+1 − ti ) − (ti − ti−1 ) ti+1 − ti−1 d(ti ) = = 1) Weighing of Individual Words: In addition to the fre- 2 2 quency of a word, one can also obtain information about . For each position ti the cluster index γ is computed: the distribution of a word. The authors of the approach { µ−d(ti ) suggest using this information to determine the significance , if d(ti ) < µ γ(ti ) = µ of individual words. The metrics proposed in [5] are based 0, else on the phenomenon of the attraction of keywords in the text. Finally, the score of a word is obtained from the average of So, for example, the authors give the distribution of distances all its cluster indexes: to the nearest neighbor for the word ”NATURE” in the book n ”The Origin of Species” by Charles Darwin (Fig. 1). On the 1∑ graph, you can see that the distribution of distances resembles Γ= γ(ti ) n i=1 an exponential distribution. In the case of noise words, this property is much weaker. Thus, keywords tend to form clusters Γ-index is more stable than σ-index. However, it is more time- in a text unlike noise words. consuming to evaluate than σ. 32 Precision 31.8% 0.4 Recall 74.2% Tab. 3. Evaluations for RAKE without any modification 0.3 precision 0.2 4) Extracting of Key Phrases: The metrics described above 0.1 makes it possible to rank only individual words. In [4], pro- posed several ways to generalize metrics to two-word phrases: 2.5 5.0 length 7.5 10.0 First Way: Rank whole word sequences. The weight of the sequence is calculated similarly to a single word. The problem 0.6 is that the frequency of the desired phrases may be insufficient for analysis. recall 0.4 Second way: Make up all possible unique phrases of a given length. To compute the score of a phrase we need to 0.2 sum up all score of individual words from which the phrase consists. The problem with this approach is that the number 2.5 5.0 length 7.5 10.0 of phrases grows exponentially with the given length. Fig. 2. The dependence of the efficiency on the restriction on the length of a phrase III. E XPERIMENTS A. Data Description Max. length Precision Recall NWords We used the book ”Abel Theorem in Solutions Problems” 4 40.8% 73.0% 9327 written by Alekseev V.B. as a test data. The text consists 5 41.5% 73.6% 7293 of 39519 words 1576 of which are unique. The alphabetical index at the end of the book was used to evaluate the retrieval Tab. 4. RAKE evaluations with phrase length limitations capabilities of different key phrase extractors. The choice of the test data is explained, firstly, by a large volume of the text, which allows us to give more accurate estimates of the The result set contained 4320 phrases and all these phrases efficiency of algorithms, and secondly, by the requirements of were unique in terms of word forms. However, we faced the the statistical approach to a number of occurrences of analyzed problem: the output contained long phrases, for example, the words (for small texts this condition may not be fulfilled even longest one consisted of 10 words. We would like to have for one word). shorter phrases. It was decided to set the limit for the maximum phrase length. All phrases with length above the limit were B. Getting Evaluations split into shorter ones. From words that make up the long phrase, all possible phrases of a given length were made We measured the efficiency of algorithms in terms of Recall without taking into account the word order. Also, the limit and Precision. The result of each algorithm is a set of phrases. was set for the maximum degree of each word to support this To obtain the Precision estimate, we must determine how many changes. key phrases are in the result set. We will assume that the phrase from the result set is a key phrase if it fully contains any key Next, We studied the dependency of efficiency on the phrase from the reference set and the order of words does not restriction on the length of a phrase. As we can see on the Fig.2 count. To obtain the Recall estimate, we must determine how the highest precision was reached when the maximal phrase many key phrases from the reference set are in the result set. length was limited to 4 or 5 words. On Tab. 5 are shown exact Also, we will assume that the phrase from the reference set evaluations for RAKE with phrase length limitations. is in the result set if it is fully contained in any phrase in the It makes sense to consider only such restrictions on length, result set without regard to the order of words. In addition, since at these limitations Precision reaches its highest values, all the words in all phrases in both sets were stemmed, which with Recall close to maximum. It can be seen that the number solves the problem with comparing different forms of a word. of words extracted by the algorithm has increased dramatically. At the same time, the number of correctly extracted words also C. RAKE increased, because there is an increase in Precision. The algorithm requires a set of stop-words. We used This way of generating candidates, in theory, can cause an publicly available one from ranks.nl site. To avoid the problem exponential growth of their number. By the length of a phrase with different forms of words We used SnowballStemmer we will understand the number of words it consists of. Let from the nltk library for Python. Every word in every phrase k be a restriction on the maximum length of a phrase, m - was stemmed. For each stemmed phrase was kept the count the maximum length of a phrase that hit the derivation of the of unstemmed ones from which it could be obtained. These algorithm. By the definition of the algorithm, phrases obtained numbers were used in computing scores for individual words. after partitioning do not have intersections in the text, i.e. for Results are shown on Tab. 3: any two different phrases, their positions in the text will not 33 1.0 0.60 0.8 0.55 phrase_length filter precision precision 4 filtered 0.6 0.50 5 unfiltered 0.45 0.4 0.40 0 2500 5000 7500 0 1000 2000 3000 4000 5000 nwords nwords 0.15 0.6 phrase_length filter 0.10 recall recall 4 filtered 5 unfiltered 0.4 0.05 0.2 0.00 0 2500 5000 7500 0 1000 2000 3000 4000 5000 nwords nwords Fig. 3. Comparison of estimates for RAKE with 4 and 5 words Fig. 4. Dependence of estimates for statistical approach on the phrase length limitations. Graph shows dependence on the number number of words with the highest ranks extracted when phrase of words with the highest ranks extracted length is limited to two words 1.00 overlap. Let N be the number of words in a text. Thus, the number of such phrases is not greater than m N . We assume 0.98 that m > k. When generating new phrases, the word order is not taken into account, so the number of new phrases can be precision 0.96 filter filtered calculated as the number of combinations without repetitions: unfiltered k m! 0.94 Cm = (m − k)! · k! 0.92 Thus, we can give the following estimate of the new number 0 1000 2000 3000 4000 5000 nwords of phrases: 0.20 N k · Cm 0.15 m As we can see, the growth is proportional to the number filter recall filtered 0.10 unfiltered of combinations. By definition of the algorithm, the length of the key phrase in average can not be more than the 0.05 average sentence length. The average sentence consists of 10- 11 words[10], which means that the average value of m will not exceed this number. In this way, we can conclude that the 0 1000 2000 3000 4000 5000 nwords increase in the number of candidates will be within reasonable Fig. 5. Dependence of estimates for statistical approach on the limits for random texts. number of words with the highest ranks extracted when phrase Limiting the size of the result set, we can improve Precision length is limited to three words of an algorithm. But discarding part of the results may lead to the decreasing of Recall. Fig. 3 shows the dependence of Precision and Recall on the number of selected candidates. is around 1500, thus there will be around 106 phrases of two words and around 109 phrases of three words. Actually, there D. Statistical Approach is no need to store all candidates as we only need first T of The statistical approach provides the way of measuring the them with the maximal score. significance of individual words. Nevertheless, key phrases can consist of more than one word. In this connection, the question Selection of candidates: We will consider the case of se- arises how to form a set of candidates. The solutions proposed lecting phrases from two words. This problem can be addressed in [4] showed good results, but they require a large amount of to the next. Given two sorted arrays A[1 . . . n] and B[1 . . . n]. memory to store all candidates. For example, the number of all We want to print all n2 sums A[i]+B[j] of two elements from possible key phrases formed of two words (without taking into different arrays in descending order. There is the solution to account the order) will be equal to the number of combinations. the problem that performs in O(n2 log(n)) time and requires So, for the text used in the work, the number of unique words O(n) space. 34 Description of the solution of the problem of arrays: We the RAKE algorithm allows us to create phrases that provide a will use max-heap to store tuples consist of a sum A[i] + relatively high Recall, but proposed ranking formula does not B[j] and a couple of indexes (i, j) which defines such sum. allow to achieve high Precision while limiting the number of Elements in the heap are ordered by the first element of a candidates. In this connection, the idea of using Γ-index for tuple. ranking of a single word, but use phrases that are obtained as a result of the RAKE algorithm. Fig. 6 and Fig. 7 show a At the beginning, heap stores all pairs such as: ∀j = comparison of corresponding methods. 0 . . . n : (A[0] + B[j], (0, j)). The step of the algorithm is that the first element is 1.0 extracted from the heap - this is the element with the largest sum. Suppose that indexes (i, j) were associated with that sum. 0.8 We print extracted sum and then put a new sum in the heap method precision A[i + 1] + B[j] with the pair of respective indexes. When the hybrid rake elements of the arrays are finished, the remaining content of 0.6 the heap is displayed. From the definition of the solution, the validity of estimates of time and space consumption is obvious. 0.4 Correctness of the solution: To prove the correctness of 0 2500 5000 7500 nwords the solution it is handy to consider square matrix formed from all possible sums. We denote such matrix as C and it forms as 0.6 follows: C[i, j] = A[i] + B[j]. Since the elements of arrays A and B are sorted, we can notice that the elements of the matrix method recall hybrid are decreasing from left to right and from top to bottom. In 0.4 rake other words: ∀i, j : c[i, j] ≥ c[i + 1, j] and c[i, j] ≥ c[i, j + 1]. At each step, the following invariant is supported: the heap 0.2 stores the maximal element of each column that have not been 0 2500 5000 7500 printed yet. Since the maximal element that should get into the nwords output is in one of the columns, this proves the correctness. Fig. 6. Comparison of the efficiency of algorithms RAKE and Efficiency of the algorithm: Before calculating scores, each Hybrid when the length of phrases is limited to 4 words. The graph word was stemmed with SnowballStemmer taken from nltk shows dependence of estimates on the number of words with most ranks extracted for considered methods version 3.2.2. Thus, words that differ only by their forms were merged into equivalence classes. Then, scores were calculated for such classes. We used Gamma-index in the experiments. Also, at the stage of selecting words for weighing, it was 1.0 decided to use a filter along the length of the word: those words which length was less than the specified threshold did 0.8 not participate in the weighing and in the construction of method precision candidates. In the experiments with filtering the threshold value hybrid rake for the word length was set to 3. Experiments were carried out 0.6 for phrases of two and three words. On Fig. 4 and Fig. 5 is shown that the use of the filter have 0.4 0 2000 4000 6000 almost no impact on Recall and Precision of the algorithm. nwords The considered algorithm shows high Precision. This is due to the construction of candidates. Γ-index accurately ranks 0.6 individual words, hence top-scored ones are most likely to method be actual key phrases or to be contained in multi-word key recall hybrid 0.4 rake phrases. Because of the way of constructing candidates and due to the definition of efficiency metrics top-scored candidates 0.2 are most likely to be treated as properly extracted key phrases. Based on the results of the experiments, it can be concluded 0 2000 4000 6000 nwords that the metrics Γ-index correctly ranks individual words. Nevertheless, artificially constructed phrases can not ensure Fig. 7. Comparison of the efficiency of algorithms RAKE and Hybrid when the length of phrases is limited to 5 words. The graph the high Recall of the algorithm. shows dependence of estimates on the number of words with most ranks extracted for considered methods E. Hybrid approach The advantage of the statistical approach is the high Preci- sion of the ranking of single words. And the main drawback is the complexity of constructing candidates and incapability to ensure high Recall with those candidates. On the other hand, 35 IV. C ONCLUSION 5. Juan P. Herrera, Pedro A. Pury. Statistical Keyword Detec- tion in Literary Corpora // The European Physical Journal In the work it was shown that, with certain additions, the B, 2008, pp. 135-146 RAKE algorithm and the statistical approach can be used to 6. Xinghua Hu, Bin Wu. Automatic Keyword Exctraction extract key phrases from Russian texts. In addition, the paper Using Linguistic Features //Sixth IEEE International Con- proposed new Hybrid method that uses Γ-index for weighing ference on Data Mining - Workshops (ICDMW’06), 2006, phrases that are obtained as a result of RAKE algorithm. pp. 19-23 And it was shown that the algorithm allows to achieve higher 7. Zhengyang Liu, Jianyi Liu, Wenbin Yao, Cong Wang. Key- Precision and Recall comparing to other algorithms considered word Extraction Using PageRank on Synonym Networks // in the paper. 2010 International Conference on E-Product E-Service and E-Entertainment, 2010, pp. 1-4 V. R EFERENCES 8. Jones S., Paynter G. Automatic extraction of document 1. Analysis of data and processes[Text] / A.A. Barsegyan, keyphrases for use in digital libraries: evaluation and appli- M.S. Kupriyanov, I.I. Kholod, and others. – BVH- cations // Journal of the American Society for Information Petesburg, 2009.-510 p. Science and Technology, 2002 2. Michael Berry. Text Mining: Applications and Theory[] / 9. Gutwin C, Paynter G, Witten I, Nevill-Manning C., Frank Michael Berry, Jacob Kogan – 2010, John Wiley and Sons, E. Improving browsing in digital libraries with keyphrase Ltd.-205 p. indexes // Decision Support Systems 27(12), 1999, pp. 81- 3. Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into 104 Texts // In Proceedings of EMNLP 2004 (ed. Lin D and 10. S.A. Sharov. Frequency dictionary [Online]. Avail- Wu D), pp. 404-411 able: http://www.artint.ru/projects/frqlist.php [Accessed: 4. Siddiqi, S., Sharan, A. Keyword and keyphrase extraction 20.05.2017] from single Hindi document using statistical approach // 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN) 2015, pp. 713-718 36