=Paper=
{{Paper
|id=Vol-2036/T3-4
|storemode=property
|title=An Extraction based approach to Keyword Generation and Precedence Retrieval: BITS Pilani - Hyderabad
|pdfUrl=https://ceur-ws.org/Vol-2036/T3-4.pdf
|volume=Vol-2036
|authors=G. V. Sandeep,Shikhar Bharadwaj
|dblpUrl=https://dblp.org/rec/conf/fire/SandeepB17
}}
==An Extraction based approach to Keyword Generation and Precedence Retrieval: BITS Pilani - Hyderabad==
An Extraction based approach to Keyword generation and Precedence Retrieval: BITS, Pilani - Hyderabad G. V. Sandeep Shikhar Bharadwaj Student, Department of Computer Science and Student, Department of Computer Science and Information Systems, BITS Hyderabad Information Systems, BITS Hyderabad Telangana, India Telangana, India gvsandeep2647@gmail.com shikhar.coder@gmail.com ABSTRACT (2) Given a case document, find documents of similar cases us- Precedence Retrieval is an information retrieval task that involves ing nearest neighbour approach. ranking the given set of documents according to their relevance to The rest of the paper is organized as follows: In Section - 2 we a query document. It is used for finding prior cases in common law first describe the tasks in detail which includes the assumptions, system. A prior or precedent case discussing the same issue can be methodology (motivated by [2]) and limitations. We present our used as a reference in the current case. With the increase in the dig- results in Section - 3 followed by possible enhancements in Section italisation of legal documents it is imperative to develop systems - 4. Finally, we conclude in Section - 5 for efficient precedence retrieval. This paper proposes a method for the same based on Keyword extraction and Nearest Neighbor algo- 2 TASK DESCRIPTION rithms. Keywords can be used to summarise a document. We have The FIRE 2017-IRLed track, motivated by the need for an efficient extracted the keywords for each document to be used in Prece- legal document retrieval system, had the following subtasks : dence Retrieval using TF-IDF and other relevance scores. The key- words are then used to rank the documents.The dataset for ex- perimentation was obtained from FIRE 2017 IRLeD track[6]. The 2.1 Subtask - 1 results of keyword extraction task have been expressed in Pre- 2.1.1 Description. Catchphrase Extraction Given a set of doc- cision@10, Precision@100, mean precision and mean recall met- uments, extraction of catchphrases that describe the content of rics. For the second task of Precedence Retrieval Precision@10, Re- the document. The data provided for the task consists of 100 doc- call@10, Mean Average Precision and Mean reciprocal rank have uments and their corresponding gold standard catchphrases for been used to depict results. training. The catchphrases were obtained from Manupatra which employs legal experts to annotate case documents with catchphrases. The test set consists of 300 separate documents whose catchphrases KEYWORDS were to be found. Information Retrieval,Nearest Neighbor,Vector Space Model, Doc- ument Vector, POS Tagging, Keyword extraction, Precedence Re- 2.1.2 Assumptions. In the generation of the catchphrases, we trieval, Legal Documents Retrieval System, Document Similarity assume that all catchphrase are single word. We have chosen an ex- tractive method for catchphrase generation rather than an abstrac- tive one. The method works under the assumptions that a word 1 INTRODUCTION that is more frequent has a higher possibility of being a keyword, Associating a document with a set of keywords can prove to be some parts of speech have higher a probability of being keywords extremely useful in various domains. Especially in the domain of and keywords tend to appear together in sentences. This is the cen- law, where the documents of previous similar cases are often used tral idea of our algorithm. as references, tagging them with a set of keywords is essential. 2.1.3 Methodology. The task is to summarise a document in Keywords give a very high - level description of a document. The the form of keywords in an extractive manner. This is done in reader can save himself a lot of time by quickly evaluating the rele- two stages. First, we shorten the sentences of each document by vance of the document to him by having a look at these keywords. throwing out unimportant words. Next, we work with shortened They play a crucial role in reducing the search space and also act sentences to extract keywords. as a tool to find similarity between two documents with drastically For determining whether a word is important enough to retain low cost. There are two ways to handle the task of generating key- in a sentence we have computed a metric for each word of the words for a given document. One approach is to agree to a set of document. If this metric exceeds the threshold we retain the word exhaustive keywords apriori and classify each document to one for further processing. This metric is a linear combination of term or more of these keywords. This approach is also known as text frequency of the word and its part of speech(POS) tag weight. To classification. Another approach is to find out words within the compute the POS tag weight, we performed an initial analysis on document which can represent the whole document. the provided golden catchphrases and found the frequencies of all In this paper, we accomplish two tasks: tags. The POS score for a word is then the normalized frequency (1) Generate keywords for a given document using extractive of the POS the word is tagged as. More formally, methods. word_weiдht = t f + α ∗ (POS Score) If this word_weight exceeds the threshold for the word we re- word tokenizer 1 . From this list, all those words which belong to tain the word to shorten the document. The threshold is calculated the set of stop words defined for the English language as per nltk for each word in the vocabulary(dictated by dataset corpora) sep- are removed. Also, all those words whose lengths are less than or arately. The threshold is the frequency which gives a minimum equal to four were also removed. For all the remaining words their error on the already available catchphrase and document pair. frequency in that particular document was recorded. This data was From the remaining words in the sentences, we count all fre- stored in a global dictionary which had its key as a word and value quencies in a given document and find the top k most frequent as a list of frequencies of that word in each document of the corpus. words. These are the popular words for the document. Then we This list was then condensed to remove all zero entries. construct a multiset S of all sentences that contain at least one After this, each word is given a weight which combines its IDF popular word. Next, we consider all words of all sentences in S score with that of the POS Score. The POS Score is estimated on and remove those that appear only once. From this filtered set of the basis of keywords given in Task - 1. POS tags of all the words candidate words, we keep only unique words.This forms our set were noted and based on the counts, the chances of a word being of final keywords. Then we compute the score for each word by a keyword given its POS tag was estimated. For Eg.: if there are summing up the TF-IDF score, word frequency and the number of two keywords with POS tag2 as ’VBG’ and three keywords with occurrences in S. This is the importance score of the word which POS tag as ’JJ’ then given a word with POS tag of ’JJ’, it has 0.6 determines how relevant the word is when describing the contents probability of being a keyword. This was to ensure that proper of the document. A sorted list of these words with normalized score nouns or other words which tend to have a higher IDF Score would is produced as output. be scaled properly. 2.1.4 Limitations. The limitations of our model arise from the word = IDF Score ∗ POS Score very assumptions that make the task simpler, namely: N IDF Score = (1 + log( d )) (1) Only one-word keywords can be extracted from the docu- df ment. The model ignores the fact that the catchphrase can Nd = Number of documents contain more than one word that describes the document. d f = Number of documents in which the word occurs (2) Sometimes a word that occurs not so frequently in the doc- Once each word had been associated with a weight, they were ument can be a keyword. Our model will definitely over- sorted in decreasing order of their weights and only the top 5000 look this fact. Conversely, our model will predict that cer- words were retained. This effectively meant that each document tain word is a keyword just because it occurs many times in would now be represented in this 5000 dimension space as a vec- the text. These are the corner cases we chose to ignore to tor. Suppose the first element of the vector was associated with the keep our method simple. word abducts which had a weight of 0.7. Now if in a given docu- ment, the word abducts occurred 4 times, then the vector repre- 2.2 Subtask - 2 senting the document would have its first element as 4 ∗ 0.7 = 2.8. 2.2.1 Description. Precedence Retrieval Given two sets of doc- In a similar manner, the entire vector was created for each docu- uments A and B, rank those in A according to their relevance for ment. each case in B. The data for precedence retrieval consists of: Query_docs Similarly, all cases in Prior_Cases were also represented in this - current cases, formed by removing the links to the prior cases and 5000-dimensional space. Each case in Current_Cases was then Object_docs - the prior cases which have been cited by the cases compared with each case in Prior_Cases. The similarity score was in Query_docs (links to which are removed from the Query_docs) given by the dot product of the two vectors representing the cases ăalong with someărandom cases (not among Query_docs). There and they were reported in the decreasing order of their similarity. are 200 documents in the Query_docs and more than 2000 in Ob- 2.2.4 Limitations. The following are the limitations of the model: ject_docs. For each case in the Query_docs, the task was to rank all the 2000+ cases in the Object_docs such that all the actually cited (1) This model may not give a high similarity score to docu- prior cases are ranked higher than other documents. The task is ments which use different words to convey the same mean- made challenging by adding cases in Object_docs that are not re- ing. Suppose a document extensively uses the word abducts lated to any Query_doc. and another document uses the word kidnaps instead. They may or may not be given a high similarity score although 2.2.2 Assumptions. The approach presented here uses a slight they cater to a similar category of cases. variation of the Bag of words representation of a document to find (2) Other than the fact that the documents have similar word the similarity among them. Thus, it assumes that similar cases will usage no other explanation can be given to the results ob- have similar word usage in their body. Thus, higher correlation tained. In other words, results are not explainable. between the words used and case’s context would lead to better (3) This approach can be very time-consuming. For each query results. document, it has to find its similarity with all the existing cases in the 5000-dimensional space which itself might change 2.2.3 Methodology. In the preprocessing phase, each case doc- with the addition of new documents to the corpus. ument from the Current_Cases folder was read line by line. Every 1 http://www.nltk.org/api/nltk.tokenize.html line was then tokenized into a list of words using nltk packages’ 2 http://www.nltk.org/book/ch05.html 2 3 RESULTS Another promising idea has been presented in [1] where the Our algorithm was run on the dataset provided by FIRE 2017-IRLeD authors have used Naive Bayes Classifier on corpus pertaining to track. The results produced by our model were compared with a single domain. They have used features like T F X IDF Score and those tagged manually by legal experts and precision and recall Distance (of the catchphrase from the beginning of the sentence) values were calculated. For the second task, mean reciprocal rank to build their model. The Kea catchphrase generation algorithm was also calculated. The results are as follows: was used to generate set of candidate catchphrases. Since the main purpose of keyphrases or keywords is to give Table 1: Subtask - 1 Results a high-level description of the document, techniques which build a lattice of concepts for a given document will enhance results as Methods/ Mean R Mean Mean Mean Overall well as explainability. This phenomenon has been used as a back- Evalua- preci- Preci- Recall Aver- Recall bone in both [9]. Similarly, semantic chains BioChain has been used tion sion sion at at 100 age on the domain-specific corpus in [7]. Along with this they also Metrics 10 Preci- proposed FreqDist - a frequency distribution based approach and sion a hybrid method which combines both BioChain and FreqDist. bphc_with 0.065781 0.102 0.13650 0.16067 0.16548 All these methods have shown improvements on their tradi- POS_1 20144 79757 55473 09155 tional counterparts on the basis of ROGUE score. Details and for- mulation of ROGUE scores can be found in [5] For Subtask-1, table 1 shows the results. We were ranked fifth with a Mean Average Precision(MAP) of 16.1% and an overall re- 5 CONCLUSION call of 16.5%. When compared with other systems submitted in Keywords are a significant help to lawyers for determining prece- the competition, it can be noticed that our system provided almost dents for a case. In this paper, we proposed a very simple key- equal weight to precision and recall. word extraction algorithm and a precedence retrieval algorithm that uses our keyword extraction algorithm to work. The keyword Table 2: Subtask - 2 Results extraction algorithm utilizes both the frequency of words and the POS tag of the word to determine its importance. The precedence Team_ID/ Mean Mean Re- Precision Recall at retrieval algorithm builds a vector for each document and com- Methods Average ciprocal at 10 10 putes the similarities accordingly. Despite the simplicity of our Precision Rank model we achieved surprisingly good results. bphcTASK2 0.0711 0.1975 0.06 0.28 IRLeD 6 ACKNOWLEDGEMENTS We would like to express sincere thanks to our mentor Dr. Aruna For subtask-2, our system’s results are depicted in table 2. We Malapati3 , Department of Computer Science and Information Sys- had the MAP of 7.11% and a mean reciprocal rank of 19.75%. For a tems, BITS, Pilani - Hyderabad Campus for her constant guidance system as basic as ours, it is a surprisingly good result. and also for the help in writing this paper. 4 POSSIBLE ENHANCEMENTS REFERENCES [1] Eibe Frank, Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G Nevill- The models proposed for both the tasks rely highly on the impor- Manning. 1999. Domain-specific keyphrase extraction. In 16th International Joint tance of individual words. Thus, one straightforward enhancement Conference on Artificial Intelligence (IJCAI 99), Vol. 2. Morgan Kaufmann Publish- would be to generate keyphrases and extend it to extract catch- ers Inc., San Francisco, CA, USA, 668–673. [2] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Towards automatic phrases. One of the possible enhancement could be to exploit the generation of catchphrases for legal case reports. Computational Linguistics and semantic similarity between words as suggested in [3]. In this pa- Intelligent Text Processing (2012), 414–425. [3] Mohamed H Haggag. 2013. Keyword extraction using semantic analysis. Inter- per, the authors have used features of WordNet along with the national Journal of Computer Applications 61, 1 (2013). standard preprocessing of stop words removal and stemming to [4] Michael Lesk. 1986. Automatic sense disambiguation using machine readable group semantically similar words and replace them with the most dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation. ACM, 24–26. commonly occurring term of that group. The basic idea is to find [5] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. similarity on the basis of context as against to word usage. They In Text summarization branches out: Proceedings of the ACL-04 workshop, Vol. 8. also took help of the Michael Lesk Algorithm [4] to disambiguate Barcelona, Spain. [6] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap- words in the sentence context. tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from As our model takes advantage of the fact that words with a cer- Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org. tain POS tags are more likely to be keywords, better POS Tagging [7] Lawrence H Reeve, Hyoil Han, and Ari D Brooks. 2007. The use of domain-specific techniques will obviously help the model achieve better results. As concepts in biomedical text summarization. Information Processing & Manage- suggested in [8], POS tagging which considers the tags of the sur- ment 43, 6 (2007), 1765–1776. [8] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. rounding words as well as joint features of current word and sur- Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro- rounding words outperforms the traditional taggers. They have ceedings of the 2003 Conference of the North American Chapter of the Association achieved better results in tagging by using this idea along with 3 http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile lexicalization and smoothing. 3 for Computational Linguistics on Human Language Technology-Volume 1. Associa- tion for Computational Linguistics, 173–180. [9] Shiren Ye, Tat-Seng Chua, Min-Yen Kan, and Long Qiu. 2007. Document con- cept lattice for text understanding and summarization. Information Processing & Management 43, 6 (2007), 1643–1662. 4