=Paper=
{{Paper
|id=Vol-2036/T3-4
|storemode=property
|title=An Extraction based approach to Keyword Generation and Precedence Retrieval: BITS Pilani - Hyderabad
|pdfUrl=https://ceur-ws.org/Vol-2036/T3-4.pdf
|volume=Vol-2036
|authors=G. V. Sandeep,Shikhar Bharadwaj
|dblpUrl=https://dblp.org/rec/conf/fire/SandeepB17
}}
==An Extraction based approach to Keyword Generation and Precedence Retrieval: BITS Pilani - Hyderabad==
<pdf width="1500px">https://ceur-ws.org/Vol-2036/T3-4.pdf</pdf>
<pre>
       An Extraction based approach to Keyword generation and
            Precedence Retrieval: BITS, Pilani - Hyderabad
                           G. V. Sandeep                                                   Shikhar Bharadwaj
          Student, Department of Computer Science and                         Student, Department of Computer Science and
              Information Systems, BITS Hyderabad                                 Information Systems, BITS Hyderabad
                        Telangana, India                                                    Telangana, India
                   gvsandeep2647@gmail.com                                              shikhar.coder@gmail.com

ABSTRACT                                                                    (2) Given a case document, find documents of similar cases us-
Precedence Retrieval is an information retrieval task that involves             ing nearest neighbour approach.
ranking the given set of documents according to their relevance to           The rest of the paper is organized as follows: In Section - 2 we
a query document. It is used for finding prior cases in common law        first describe the tasks in detail which includes the assumptions,
system. A prior or precedent case discussing the same issue can be        methodology (motivated by [2]) and limitations. We present our
used as a reference in the current case. With the increase in the dig-    results in Section - 3 followed by possible enhancements in Section
italisation of legal documents it is imperative to develop systems       - 4. Finally, we conclude in Section - 5
for efficient precedence retrieval. This paper proposes a method for
the same based on Keyword extraction and Nearest Neighbor algo-          2 TASK DESCRIPTION
rithms. Keywords can be used to summarise a document. We have
                                                                         The FIRE 2017-IRLed track, motivated by the need for an efficient
extracted the keywords for each document to be used in Prece-
                                                                         legal document retrieval system, had the following subtasks :
dence Retrieval using TF-IDF and other relevance scores. The key-
words are then used to rank the documents.The dataset for ex-
perimentation was obtained from FIRE 2017 IRLeD track[6]. The
                                                                         2.1 Subtask - 1
results of keyword extraction task have been expressed in Pre-              2.1.1 Description. Catchphrase Extraction Given a set of doc-
cision@10, Precision@100, mean precision and mean recall met-            uments, extraction of catchphrases that describe the content of
rics. For the second task of Precedence Retrieval Precision@10, Re-      the document. The data provided for the task consists of 100 doc-
call@10, Mean Average Precision and Mean reciprocal rank have            uments and their corresponding gold standard catchphrases for
been used to depict results.                                             training. The catchphrases were obtained from Manupatra which
                                                                         employs legal experts to annotate case documents with catchphrases.
                                                                         The test set consists of 300 separate documents whose catchphrases
KEYWORDS
                                                                         were to be found.
Information Retrieval,Nearest Neighbor,Vector Space Model, Doc-
ument Vector, POS Tagging, Keyword extraction, Precedence Re-               2.1.2 Assumptions. In the generation of the catchphrases, we
trieval, Legal Documents Retrieval System, Document Similarity           assume that all catchphrase are single word. We have chosen an ex-
                                                                         tractive method for catchphrase generation rather than an abstrac-
                                                                         tive one. The method works under the assumptions that a word
1 INTRODUCTION                                                           that is more frequent has a higher possibility of being a keyword,
Associating a document with a set of keywords can prove to be            some parts of speech have higher a probability of being keywords
extremely useful in various domains. Especially in the domain of         and keywords tend to appear together in sentences. This is the cen-
law, where the documents of previous similar cases are often used        tral idea of our algorithm.
as references, tagging them with a set of keywords is essential.
                                                                            2.1.3 Methodology. The task is to summarise a document in
Keywords give a very high - level description of a document. The
                                                                         the form of keywords in an extractive manner. This is done in
reader can save himself a lot of time by quickly evaluating the rele-
                                                                         two stages. First, we shorten the sentences of each document by
vance of the document to him by having a look at these keywords.
                                                                         throwing out unimportant words. Next, we work with shortened
They play a crucial role in reducing the search space and also act
                                                                         sentences to extract keywords.
as a tool to find similarity between two documents with drastically
                                                                            For determining whether a word is important enough to retain
low cost. There are two ways to handle the task of generating key-
                                                                         in a sentence we have computed a metric for each word of the
words for a given document. One approach is to agree to a set of
                                                                         document. If this metric exceeds the threshold we retain the word
exhaustive keywords apriori and classify each document to one
                                                                         for further processing. This metric is a linear combination of term
or more of these keywords. This approach is also known as text
                                                                         frequency of the word and its part of speech(POS) tag weight. To
classification. Another approach is to find out words within the
                                                                         compute the POS tag weight, we performed an initial analysis on
document which can represent the whole document.
                                                                         the provided golden catchphrases and found the frequencies of all
   In this paper, we accomplish two tasks:
                                                                         tags. The POS score for a word is then the normalized frequency
   (1) Generate keywords for a given document using extractive           of the POS the word is tagged as. More formally,
       methods.                                                             word_weiдht = t f + α ∗ (POS Score)
   If this word_weight exceeds the threshold for the word we re-           word tokenizer 1 . From this list, all those words which belong to
tain the word to shorten the document. The threshold is calculated         the set of stop words defined for the English language as per nltk
for each word in the vocabulary(dictated by dataset corpora) sep-          are removed. Also, all those words whose lengths are less than or
arately. The threshold is the frequency which gives a minimum              equal to four were also removed. For all the remaining words their
error on the already available catchphrase and document pair.              frequency in that particular document was recorded. This data was
   From the remaining words in the sentences, we count all fre-            stored in a global dictionary which had its key as a word and value
quencies in a given document and find the top k most frequent              as a list of frequencies of that word in each document of the corpus.
words. These are the popular words for the document. Then we               This list was then condensed to remove all zero entries.
construct a multiset S of all sentences that contain at least one             After this, each word is given a weight which combines its IDF
popular word. Next, we consider all words of all sentences in S            score with that of the POS Score. The POS Score is estimated on
and remove those that appear only once. From this filtered set of          the basis of keywords given in Task - 1. POS tags of all the words
candidate words, we keep only unique words.This forms our set              were noted and based on the counts, the chances of a word being
of final keywords. Then we compute the score for each word by              a keyword given its POS tag was estimated. For Eg.: if there are
summing up the TF-IDF score, word frequency and the number of              two keywords with POS tag2 as ’VBG’ and three keywords with
occurrences in S. This is the importance score of the word which           POS tag as ’JJ’ then given a word with POS tag of ’JJ’, it has 0.6
determines how relevant the word is when describing the contents           probability of being a keyword. This was to ensure that proper
of the document. A sorted list of these words with normalized score        nouns or other words which tend to have a higher IDF Score would
is produced as output.                                                     be scaled properly.

   2.1.4 Limitations. The limitations of our model arise from the                            word = IDF Score ∗ POS Score
very assumptions that make the task simpler, namely:                                                               N
                                                                                              IDF Score = (1 + log( d ))
   (1) Only one-word keywords can be extracted from the docu-                                                      df
       ment. The model ignores the fact that the catchphrase can
                                                                           Nd = Number of documents
       contain more than one word that describes the document.
                                                                           d f = Number of documents in which the word occurs
   (2) Sometimes a word that occurs not so frequently in the doc-
                                                                               Once each word had been associated with a weight, they were
       ument can be a keyword. Our model will definitely over-
                                                                           sorted in decreasing order of their weights and only the top 5000
       look this fact. Conversely, our model will predict that cer-
                                                                           words were retained. This effectively meant that each document
       tain word is a keyword just because it occurs many times in
                                                                           would now be represented in this 5000 dimension space as a vec-
       the text. These are the corner cases we chose to ignore to
                                                                           tor. Suppose the first element of the vector was associated with the
       keep our method simple.
                                                                           word abducts which had a weight of 0.7. Now if in a given docu-
                                                                           ment, the word abducts occurred 4 times, then the vector repre-
 2.2 Subtask - 2                                                           senting the document would have its first element as 4 ∗ 0.7 = 2.8.
    2.2.1 Description. Precedence Retrieval Given two sets of doc-         In a similar manner, the entire vector was created for each docu-
 uments A and B, rank those in A according to their relevance for          ment.
 each case in B. The data for precedence retrieval consists of: Query_docs     Similarly, all cases in Prior_Cases were also represented in this
- current cases, formed by removing the links to the prior cases and       5000-dimensional space. Each case in Current_Cases was then
 Object_docs - the prior cases which have been cited by the cases          compared with each case in Prior_Cases. The similarity score was
 in Query_docs (links to which are removed from the Query_docs)            given by the dot product of the two vectors representing the cases
 ăalong with someărandom cases (not among Query_docs). There               and they were reported in the decreasing order of their similarity.
 are 200 documents in the Query_docs and more than 2000 in Ob-
                                                                               2.2.4 Limitations. The following are the limitations of the model:
 ject_docs. For each case in the Query_docs, the task was to rank all
 the 2000+ cases in the Object_docs such that all the actually cited            (1) This model may not give a high similarity score to docu-
 prior cases are ranked higher than other documents. The task is                     ments which use different words to convey the same mean-
 made challenging by adding cases in Object_docs that are not re-                    ing. Suppose a document extensively uses the word abducts
 lated to any Query_doc.                                                             and another document uses the word kidnaps instead. They
                                                                                     may or may not be given a high similarity score although
    2.2.2 Assumptions. The approach presented here uses a slight                     they cater to a similar category of cases.
variation of the Bag of words representation of a document to find              (2) Other than the fact that the documents have similar word
 the similarity among them. Thus, it assumes that similar cases will                 usage no other explanation can be given to the results ob-
 have similar word usage in their body. Thus, higher correlation                     tained. In other words, results are not explainable.
 between the words used and case’s context would lead to better                 (3) This approach can be very time-consuming. For each query
 results.                                                                            document, it has to find its similarity with all the existing
                                                                                     cases in the 5000-dimensional space which itself might change
    2.2.3 Methodology. In the preprocessing phase, each case doc-                    with the addition of new documents to the corpus.
 ument from the Current_Cases folder was read line by line. Every          1 http://www.nltk.org/api/nltk.tokenize.html
 line was then tokenized into a list of words using nltk packages’         2 http://www.nltk.org/book/ch05.html

                                                                       2
3 RESULTS                                                                      Another promising idea has been presented in [1] where the
Our algorithm was run on the dataset provided by FIRE 2017-IRLeD            authors have used Naive Bayes Classifier on corpus pertaining to
track. The results produced by our model were compared with                 a single domain. They have used features like T F X IDF Score and
those tagged manually by legal experts and precision and recall             Distance (of the catchphrase from the beginning of the sentence)
values were calculated. For the second task, mean reciprocal rank           to build their model. The Kea catchphrase generation algorithm
was also calculated. The results are as follows:                            was used to generate set of candidate catchphrases.
                                                                               Since the main purpose of keyphrases or keywords is to give
                  Table 1: Subtask - 1 Results                              a high-level description of the document, techniques which build
                                                                            a lattice of concepts for a given document will enhance results as
 Methods/ Mean R       Mean       Mean            Mean        Overall       well as explainability. This phenomenon has been used as a back-
 Evalua- preci-        Preci-     Recall          Aver-       Recall        bone in both [9]. Similarly, semantic chains BioChain has been used
 tion     sion         sion at    at 100          age                       on the domain-specific corpus in [7]. Along with this they also
 Metrics               10                         Preci-                    proposed FreqDist - a frequency distribution based approach and
                                                  sion                      a hybrid method which combines both BioChain and FreqDist.
 bphc_with 0.065781 0.102         0.13650         0.16067     0.16548          All these methods have shown improvements on their tradi-
 POS_1     20144                  79757           55473       09155         tional counterparts on the basis of ROGUE score. Details and for-
                                                                            mulation of ROGUE scores can be found in [5]
   For Subtask-1, table 1 shows the results. We were ranked fifth
with a Mean Average Precision(MAP) of 16.1% and an overall re-
                                                                            5 CONCLUSION
call of 16.5%. When compared with other systems submitted in                Keywords are a significant help to lawyers for determining prece-
the competition, it can be noticed that our system provided almost          dents for a case. In this paper, we proposed a very simple key-
equal weight to precision and recall.                                       word extraction algorithm and a precedence retrieval algorithm
                                                                            that uses our keyword extraction algorithm to work. The keyword
                  Table 2: Subtask - 2 Results                              extraction algorithm utilizes both the frequency of words and the
                                                                            POS tag of the word to determine its importance. The precedence
 Team_ID/  Mean             Mean Re-       Precision        Recall at       retrieval algorithm builds a vector for each document and com-
 Methods   Average          ciprocal       at 10            10              putes the similarities accordingly. Despite the simplicity of our
           Precision        Rank                                            model we achieved surprisingly good results.
 bphcTASK2 0.0711           0.1975         0.06             0.28
 IRLeD                                                                      6 ACKNOWLEDGEMENTS
                                                                            We would like to express sincere thanks to our mentor Dr. Aruna
   For subtask-2, our system’s results are depicted in table 2. We          Malapati3 , Department of Computer Science and Information Sys-
had the MAP of 7.11% and a mean reciprocal rank of 19.75%. For a            tems, BITS, Pilani - Hyderabad Campus for her constant guidance
system as basic as ours, it is a surprisingly good result.                  and also for the help in writing this paper.

4 POSSIBLE ENHANCEMENTS                                                     REFERENCES
                                                                            [1] Eibe Frank, Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G Nevill-
The models proposed for both the tasks rely highly on the impor-                Manning. 1999. Domain-specific keyphrase extraction. In 16th International Joint
tance of individual words. Thus, one straightforward enhancement                Conference on Artificial Intelligence (IJCAI 99), Vol. 2. Morgan Kaufmann Publish-
would be to generate keyphrases and extend it to extract catch-                 ers Inc., San Francisco, CA, USA, 668–673.
                                                                            [2] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Towards automatic
phrases. One of the possible enhancement could be to exploit the                generation of catchphrases for legal case reports. Computational Linguistics and
semantic similarity between words as suggested in [3]. In this pa-              Intelligent Text Processing (2012), 414–425.
                                                                            [3] Mohamed H Haggag. 2013. Keyword extraction using semantic analysis. Inter-
per, the authors have used features of WordNet along with the                   national Journal of Computer Applications 61, 1 (2013).
standard preprocessing of stop words removal and stemming to                [4] Michael Lesk. 1986. Automatic sense disambiguation using machine readable
group semantically similar words and replace them with the most                 dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the
                                                                                5th annual international conference on Systems documentation. ACM, 24–26.
commonly occurring term of that group. The basic idea is to find            [5] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.
similarity on the basis of context as against to word usage. They               In Text summarization branches out: Proceedings of the ACL-04 workshop, Vol. 8.
also took help of the Michael Lesk Algorithm [4] to disambiguate                Barcelona, Spain.
                                                                            [6] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap-
words in the sentence context.                                                  tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from
   As our model takes advantage of the fact that words with a cer-              Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information
                                                                                Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org.
tain POS tags are more likely to be keywords, better POS Tagging            [7] Lawrence H Reeve, Hyoil Han, and Ari D Brooks. 2007. The use of domain-specific
techniques will obviously help the model achieve better results. As             concepts in biomedical text summarization. Information Processing & Manage-
suggested in [8], POS tagging which considers the tags of the sur-              ment 43, 6 (2007), 1765–1776.
                                                                            [8] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003.
rounding words as well as joint features of current word and sur-               Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro-
rounding words outperforms the traditional taggers. They have                   ceedings of the 2003 Conference of the North American Chapter of the Association
achieved better results in tagging by using this idea along with
                                                                            3 http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile
lexicalization and smoothing.
                                                                        3
    for Computational Linguistics on Human Language Technology-Volume 1. Associa-
    tion for Computational Linguistics, 173–180.
[9] Shiren Ye, Tat-Seng Chua, Min-Yen Kan, and Long Qiu. 2007. Document con-
    cept lattice for text understanding and summarization. Information Processing &
    Management 43, 6 (2007), 1643–1662.


                                                                                      4

</pre>