<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Extraction based approach to Keyword generation and Precedence Retrieval: BITS, Pilani - Hyderabad</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>G. V. Sandeep</string-name>
          <email>gvsandeep2647@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Retrieval,Nearest Neighbor,Vector Space Model, Doc-</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shikhar Bharadwaj</string-name>
          <email>shikhar.coder@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Student, Department of Computer Science and</institution>
          ,
          <addr-line>Information Systems, BITS Hyderabad, Telangana</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ument Vector, POS Tagging, Keyword extraction, Precedence Retrieval, Legal Documents Retrieval System</institution>
          ,
          <addr-line>Document Similarity</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Precedence Retrieval is an information retrieval task that involves ranking the given set of documents according to their relevance to a query document. It is used for finding prior cases in common law system. A prior or precedent case discussing the same issue can be used as a reference in the current case. With the increase in the digitalisation of legal documents it is imperative to develop systems for eficient precedence retrieval. This paper proposes a method for the same based on Keyword extraction and Nearest Neighbor algorithms. Keywords can be used to summarise a document. We have extracted the keywords for each document to be used in Precedence Retrieval using TF-IDF and other relevance scores. The keywords are then used to rank the documents.The dataset for experimentation was obtained from FIRE 2017 IRLeD track[6]. The results of keyword extraction task have been expressed in Precision@10, Precision@100, mean precision and mean recall metrics. For the second task of Precedence Retrieval Precision@10, Recall@10, Mean Average Precision and Mean reciprocal rank have been used to depict results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Associating a document with a set of keywords can prove to be
extremely useful in various domains. Especially in the domain of
law, where the documents of previous similar cases are often used
as references, tagging them with a set of keywords is essential.
Keywords give a very high - level description of a document. The
reader can save himself a lot of time by quickly evaluating the
relevance of the document to him by having a look at these keywords.
They play a crucial role in reducing the search space and also act
as a tool to find similarity between two documents with drastically
low cost. There are two ways to handle the task of generating
keywords for a given document. One approach is to agree to a set of
exhaustive keywords apriori and classify each document to one
or more of these keywords. This approach is also known as text
classification. Another approach is to find out words within the
document which can represent the whole document.</p>
      <p>In this paper, we accomplish two tasks:
(1) Generate keywords for a given document using extractive
methods.
(2) Given a case document, find documents of similar cases
using nearest neighbour approach.</p>
      <p>
        The rest of the paper is organized as follows: In Section - 2 we
ifrst describe the tasks in detail which includes the assumptions,
methodology (motivated by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and limitations. We present our
results in Section - 3 followed by possible enhancements in Section
- 4. Finally, we conclude in Section - 5
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>The FIRE 2017-IRLed track, motivated by the need for an eficient
legal document retrieval system, had the following subtasks :
2.1</p>
      <p>Subtask - 1</p>
      <sec id="sec-2-1">
        <title>2.1.1 Description. Catchphrase Extraction Given a set of doc</title>
        <p>uments, extraction of catchphrases that describe the content of
the document. The data provided for the task consists of 100
documents and their corresponding gold standard catchphrases for
training. The catchphrases were obtained from Manupatra which
employs legal experts to annotate case documents with catchphrases.
The test set consists of 300 separate documents whose catchphrases
were to be found.</p>
        <p>2.1.2 Assumptions. In the generation of the catchphrases, we
assume that all catchphrase are single word. We have chosen an
extractive method for catchphrase generation rather than an
abstractive one. The method works under the assumptions that a word
that is more frequent has a higher possibility of being a keyword,
some parts of speech have higher a probability of being keywords
and keywords tend to appear together in sentences. This is the
central idea of our algorithm.</p>
        <p>2.1.3 Methodology. The task is to summarise a document in
the form of keywords in an extractive manner. This is done in
two stages. First, we shorten the sentences of each document by
throwing out unimportant words. Next, we work with shortened
sentences to extract keywords.</p>
        <p>For determining whether a word is important enough to retain
in a sentence we have computed a metric for each word of the
document. If this metric exceeds the threshold we retain the word
for further processing. This metric is a linear combination of term
frequency of the word and its part of speech(POS) tag weight. To
compute the POS tag weight, we performed an initial analysis on
the provided golden catchphrases and found the frequencies of all
tags. The POS score for a word is then the normalized frequency
of the POS the word is tagged as. More formally,
word_weiдht = tf + α (POS Score )</p>
        <p>If this word_weight exceeds the threshold for the word we
retain the word to shorten the document. The threshold is calculated
for each word in the vocabulary(dictated by dataset corpora)
separately. The threshold is the frequency which gives a minimum
error on the already available catchphrase and document pair.</p>
        <p>From the remaining words in the sentences, we count all
frequencies in a given document and find the top k most frequent
words. These are the popular words for the document. Then we
construct a multiset S of all sentences that contain at least one
popular word. Next, we consider all words of all sentences in S
and remove those that appear only once. From this filtered set of
candidate words, we keep only unique words.This forms our set
of final keywords. Then we compute the score for each word by
summing up the TF-IDF score, word frequency and the number of
occurrences in S. This is the importance score of the word which
determines how relevant the word is when describing the contents
of the document. A sorted list of these words with normalized score
is produced as output.</p>
        <p>2.1.4 Limitations. The limitations of our model arise from the
very assumptions that make the task simpler, namely:
(1) Only one-word keywords can be extracted from the
document. The model ignores the fact that the catchphrase can
contain more than one word that describes the document.
(2) Sometimes a word that occurs not so frequently in the
document can be a keyword. Our model will definitely
overlook this fact. Conversely, our model will predict that
certain word is a keyword just because it occurs many times in
the text. These are the corner cases we chose to ignore to
keep our method simple.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Subtask - 2</title>
      <sec id="sec-3-1">
        <title>2.2.1 Description. Precedence Retrieval Given two sets of doc</title>
        <p>uments A and B, rank those in A according to their relevance for
each case in B. The data for precedence retrieval consists of: Query_docs
- current cases, formed by removing the links to the prior cases and
Object_docs - the prior cases which have been cited by the cases
in Query_docs (links to which are removed from the Query_docs)
ăalong with someărandom cases (not among Query_docs). There
are 200 documents in the Query_docs and more than 2000 in
Object_docs. For each case in the Query_docs, the task was to rank all
the 2000+ cases in the Object_docs such that all the actually cited
prior cases are ranked higher than other documents. The task is
made challenging by adding cases in Object_docs that are not
related to any Query_doc.</p>
        <p>2.2.2 Assumptions. The approach presented here uses a slight
variation of the Bag of words representation of a document to find
the similarity among them. Thus, it assumes that similar cases will
have similar word usage in their body. Thus, higher correlation
between the words used and case’s context would lead to better
results.</p>
        <p>2.2.3 Methodology. In the preprocessing phase, each case
document from the Current_Cases folder was read line by line. Every
line was then tokenized into a list of words using nltk packages’
word tokenizer 1. From this list, all those words which belong to
the set of stop words defined for the English language as per nltk
are removed. Also, all those words whose lengths are less than or
equal to four were also removed. For all the remaining words their
frequency in that particular document was recorded. This data was
stored in a global dictionary which had its key as a word and value
as a list of frequencies of that word in each document of the corpus.
This list was then condensed to remove all zero entries.</p>
        <p>After this, each word is given a weight which combines its IDF
score with that of the POS Score. The POS Score is estimated on
the basis of keywords given in Task - 1. POS tags of all the words
were noted and based on the counts, the chances of a word being
a keyword given its POS tag was estimated. For Eg.: if there are
two keywords with POS tag2 as ’VBG’ and three keywords with
POS tag as ’JJ’ then given a word with POS tag of ’JJ’, it has 0.6
probability of being a keyword. This was to ensure that proper
nouns or other words which tend to have a higher IDF Score would
be scaled properly.</p>
        <p>word = I DF Score POS Score</p>
        <p>I DF Score = (1 + log( Nd ))
d f
Nd = Number of documents
d f = Number of documents in which the word occurs</p>
        <p>Once each word had been associated with a weight, they were
sorted in decreasing order of their weights and only the top 5000
words were retained. This efectively meant that each document
would now be represented in this 5000 dimension space as a
vector. Suppose the first element of the vector was associated with the
word abducts which had a weight of 0.7. Now if in a given
document, the word abducts occurred 4 times, then the vector
representing the document would have its first element as 4 0:7 = 2:8.
In a similar manner, the entire vector was created for each
document.</p>
        <p>Similarly, all cases in Prior_Cases were also represented in this
5000-dimensional space. Each case in Current_Cases was then
compared with each case in Prior_Cases. The similarity score was
given by the dot product of the two vectors representing the cases
and they were reported in the decreasing order of their similarity.
2.2.4 Limitations. The following are the limitations of the model:
(1) This model may not give a high similarity score to
documents which use diferent words to convey the same
meaning. Suppose a document extensively uses the word abducts
and another document uses the word kidnaps instead. They
may or may not be given a high similarity score although
they cater to a similar category of cases.
(2) Other than the fact that the documents have similar word
usage no other explanation can be given to the results
obtained. In other words, results are not explainable.
(3) This approach can be very time-consuming. For each query
document, it has to find its similarity with all the existing
cases in the 5000-dimensional space which itself might change
with the addition of new documents to the corpus.
1http://www.nltk.org/api/nltk.tokenize.html
2http://www.nltk.org/book/ch05.html</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 RESULTS</title>
      <p>Our algorithm was run on the dataset provided by FIRE 2017-IRLeD
track. The results produced by our model were compared with
those tagged manually by legal experts and precision and recall
values were calculated. For the second task, mean reciprocal rank
was also calculated. The results are as follows:</p>
      <p>For Subtask-1, table 1 shows the results. We were ranked fifth
with a Mean Average Precision(MAP) of 16.1% and an overall
recall of 16.5%. When compared with other systems submitted in
the competition, it can be noticed that our system provided almost
equal weight to precision and recall.</p>
      <p>For subtask-2, our system’s results are depicted in table 2. We
had the MAP of 7.11% and a mean reciprocal rank of 19.75%. For a
system as basic as ours, it is a surprisingly good result.</p>
    </sec>
    <sec id="sec-5">
      <title>4 POSSIBLE ENHANCEMENTS</title>
      <p>
        The models proposed for both the tasks rely highly on the
importance of individual words. Thus, one straightforward enhancement
would be to generate keyphrases and extend it to extract
catchphrases. One of the possible enhancement could be to exploit the
semantic similarity between words as suggested in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this
paper, the authors have used features of WordNet along with the
standard preprocessing of stop words removal and stemming to
group semantically similar words and replace them with the most
commonly occurring term of that group. The basic idea is to find
similarity on the basis of context as against to word usage. They
also took help of the Michael Lesk Algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to disambiguate
words in the sentence context.
      </p>
      <p>
        As our model takes advantage of the fact that words with a
certain POS tags are more likely to be keywords, better POS Tagging
techniques will obviously help the model achieve better results. As
suggested in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], POS tagging which considers the tags of the
surrounding words as well as joint features of current word and
surrounding words outperforms the traditional taggers. They have
achieved better results in tagging by using this idea along with
lexicalization and smoothing.
      </p>
      <p>
        Another promising idea has been presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where the
authors have used Naive Bayes Classifier on corpus pertaining to
a single domain. They have used features like T F X I DF Score and
Distance (of the catchphrase from the beginning of the sentence)
to build their model. The Kea catchphrase generation algorithm
was used to generate set of candidate catchphrases.
      </p>
      <p>
        Since the main purpose of keyphrases or keywords is to give
a high-level description of the document, techniques which build
a lattice of concepts for a given document will enhance results as
well as explainability. This phenomenon has been used as a
backbone in both [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Similarly, semantic chains BioChain has been used
on the domain-specific corpus in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Along with this they also
proposed FreqDist - a frequency distribution based approach and
a hybrid method which combines both BioChain and FreqDist.
      </p>
      <p>
        All these methods have shown improvements on their
traditional counterparts on the basis of ROGUE score. Details and
formulation of ROGUE scores can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
      </p>
    </sec>
    <sec id="sec-6">
      <title>5 CONCLUSION</title>
      <p>Keywords are a significant help to lawyers for determining
precedents for a case. In this paper, we proposed a very simple
keyword extraction algorithm and a precedence retrieval algorithm
that uses our keyword extraction algorithm to work. The keyword
extraction algorithm utilizes both the frequency of words and the
POS tag of the word to determine its importance. The precedence
retrieval algorithm builds a vector for each document and
computes the similarities accordingly. Despite the simplicity of our
model we achieved surprisingly good results.</p>
    </sec>
    <sec id="sec-7">
      <title>6 ACKNOWLEDGEMENTS</title>
      <p>We would like to express sincere thanks to our mentor Dr. Aruna
Malapati3, Department of Computer Science and Information
Systems, BITS, Pilani - Hyderabad Campus for her constant guidance
and also for the help in writing this paper.
3http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Eibe</given-names>
            <surname>Frank</surname>
          </string-name>
          , Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G NevillManning.
          <year>1999</year>
          .
          <article-title>Domain-specific keyphrase extraction</article-title>
          .
          <source>In 16th International Joint Conference on Artificial Intelligence (IJCAI 99)</source>
          , Vol.
          <volume>2</volume>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <fpage>668</fpage>
          -
          <lpage>673</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Filippo</given-names>
            <surname>Galgani</surname>
          </string-name>
          , Paul Compton, and
          <string-name>
            <given-names>Achim</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Towards automatic generation of catchphrases for legal case reports</article-title>
          .
          <source>Computational Linguistics and Intelligent Text Processing</source>
          (
          <year>2012</year>
          ),
          <fpage>414</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Mohamed</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Haggag</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Keyword extraction using semantic analysis</article-title>
          .
          <source>International Journal of Computer Applications</source>
          <volume>61</volume>
          ,
          <issue>1</issue>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Lesk</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone</article-title>
          .
          <source>In Proceedings of the 5th annual international conference on Systems documentation. ACM</source>
          ,
          <volume>24</volume>
          -
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          .
          <source>In Text summarization branches out: Proceedings of the ACL-04 workshop</source>
          , Vol.
          <volume>8</volume>
          .
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Arpan</given-names>
            <surname>Mandal</surname>
          </string-name>
          , Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and
          <string-name>
            <given-names>Saptarshi</given-names>
            <surname>Ghosh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD)</article-title>
          .
          <source>In Working notes of FIRE</source>
          <year>2017</year>
          <article-title>- Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings</article-title>
          ).
          <source>CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Lawrence</surname>
            <given-names>H Reeve</given-names>
          </string-name>
          , Hyoil Han, and
          <string-name>
            <given-names>Ari D</given-names>
            <surname>Brooks</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The use of domain-specific concepts in biomedical text summarization</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>43</volume>
          ,
          <issue>6</issue>
          (
          <year>2007</year>
          ),
          <fpage>1765</fpage>
          -
          <lpage>1776</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Dan Klein,
          <string-name>
            <surname>Christopher D Manning</surname>
            , and
            <given-names>Yoram</given-names>
          </string-name>
          <string-name>
            <surname>Singer</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
          .
          <source>In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics</source>
          ,
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Shiren</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tat-Seng</surname>
            <given-names>Chua</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Min-Yen Kan</surname>
            ,
            <given-names>and Long</given-names>
          </string-name>
          <string-name>
            <surname>Qiu</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Document concept lattice for text understanding and summarization</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>43</volume>
          ,
          <issue>6</issue>
          (
          <year>2007</year>
          ),
          <fpage>1643</fpage>
          -
          <lpage>1662</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>