<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection of Catchphrases and Precedence in Legal Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yogesh H. Kulkarni</string-name>
          <email>yogeshkulkarni@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rishabh Patil</string-name>
          <email>rishabh@rightstepsconsultancy.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Srinivasan Shridharan</string-name>
          <email>srini@rightstepsconsultancy.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Retrieval, Conditional Random Fields, Named Entity</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consultant, RightSteps Consultancy</institution>
          ,
          <addr-line>Pune</addr-line>
          ,
          <country country="IN">India.</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Engineer, RightSteps Consultancy</institution>
          ,
          <addr-line>Pune</addr-line>
          ,
          <country country="IN">India.</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Founder, RightSteps Consultancy</institution>
          ,
          <addr-line>Pune</addr-line>
          ,
          <country country="IN">India.</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Recognition</institution>
          ,
          <addr-line>Regular Expressions, Word Embedding, Topic Modeling, Legal, Word2Vec, Legal, Text Mining, Natural Language Processing.</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>“Common Law System” practiced in India refers to statute as well as precedent to form judgments. As number of cases are increasing rapidly, automation becomes highly desirable. This paper presents two such systems viz. Automatic Catchphrase Detection and Automatic Precedence Detection. Automatic Catchphrase Detection: One of the key requirements of such information retrieval system is to pre-populate database of prior cases with catchphrases for better indexing and faster, relevant retrieval. This paper proposes an automatic catchphrases prediction for cases for the same. The problem catchphrase detection has been modeled as “custom named entity recognition (NER) using conditional random fields (CRF)”. CRF is trained with pairs of prior cases and their respective catchphrases, the gold standards. The model is, then used to predict catch-phases of unseen legal texts. End of the first section demonstrates eficacy of the proposed system using practical data-set. Automatic Precedence Detection: Due to thousands of past cases it becomes tedious and error-prone to find relevant precedent, manually. An automatic precedent retrieval system is the need of the hour. One of the key requirements of such information system is to find cases which could be “similar” to the case in hand. The “similarity” used in this paper is about citations. The problem is of predicting prior cases which could potentially be cited by a particular case text. This paper proposes such association system using mixed approaches. It employs rule-based Regular Expressions based on references to statute and Articles. It finds cosine similarity between cases using vectors generated by popular word embedding called doc2vec. It also leverages topic modeling by finding matches between cases based on the number of common topic words. End of the second section demonstrates eficacy of the proposed system by generating cite-able documents from test data-set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Probabilistic retrieval models;
Document topic models; • Computing methodologies →
Information extraction;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Indian judicial system, like many in the other parts of the world, is
based on whats called “Common Law System” in which both,
written law (called “statutes”) and prior cases (called “precedent”) are
given equal importance while forming the judgment. Such system
brings uniformity of the legal decisions across similar situations.</p>
      <p>Court cases, judgments, legal texts are typically long and
unstructured, making it hard to query relevant information from them,
unless someone goes through them manually and vigilantly.
Looking at the volume of legal text to be processed, it is desirable to
have automatic system that detect key concepts, catchphrases in
the legal texts.</p>
      <p>With number of cases increasing day by day, it has become
humanly impossible to search relevant past cases for a particular
topic. Automatic Precedent Retrieval System (APRS) is the need of
the hour. As more and more cases are coming in the digital form,
text mining has found immense importance for developing APRS.</p>
      <p>This paper is divided into two sections. The first dealing with
the task of Automatic Catchphrase Detection and the second one
of Automatic Precedence Detection.
2</p>
    </sec>
    <sec id="sec-3">
      <title>AUTOMATIC CATCHPHRASE DETECTION</title>
      <p>The aim of this section is to propose automatic catchphrase
detectionprediction system for legal text. It uses training data comprising
of pairs of text and respective catchphrases, the gold standard,
prepared manually by legal experts. The proposed system builds
probabilistic model based on this training data, which, in turn, can
predict the catchphrases in the unseen legal texts.</p>
      <p>
        The contributions made in this system are as follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) A novel method to prepare training data needed for CRF.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Feature engineering for better results with CRF
      </p>
      <p>The section has been structured as follows: in the following
section 3.1 the catchphrases detection task has been described in
details, as definition of the problem. In section 3.2, structure of the
training data has been explained. Next, the proposed system is
elaborated in Section 3.3. It describes preparation of CRF training
dataset and feature engineering adapted for this custom named entity
recognition (NER) methodology. Section 4 discusses the findings
drawn from this work.
2.1
Catchphrases are short phrases from within the text of the document.
Catchphrases can be extracted by selecting certain portions from the
text of the document[2]. The data-set provided consists of legal texts
and their respective catchphrases, along with test documents for
which the catchphrase needs to be extracted-predicted.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Data-set</title>
      <p>
        Fire-2017 [2] dataset contains following directories:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Train_docs : contains 100 case statements, case_ &lt; i &gt;
_statement :txt where i = 0 ! 99. Sample document looks
like “R.P. Sethi, J. 1. Aggrieved by the determination of
annual . . . ultimate result.”.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Train_catches: contains the gold standard catchwords for
each of these 100 case statements, case_ &lt; i &gt; _catchwords :txt
where i = 0 ! 99. Sample document looks like “Absence,
Access, Accident, Account, . . . Vehicle, Vehicles”.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Test_docs: contains 300 test case statements, similar to Train_docs,
case_ &lt; i &gt; _statement :txt where i = 100 ! 399.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>System Description</title>
      <p>2.3.1 Preprocessing. Each of the training statements were
tokenized into a list. Their Parts-of-Speech (POS) tags were generated
using python nltk [1] library. Another sequence of custom NER
tagging was made by referring to token list and given catchphrases.
B-LEGAL and I-LEGAL tags were employed for Begin and
Intermediate of the catchphrases respectively and O for other tokens.
So training data file looked like:
in
the
year
1987
and
that
property
had
extensive
national
highway
frontage</p>
      <p>IN
DT
NN
CD
CC
IN
NN
VBD
JJ
JJ
NN
NN</p>
      <p>O
O
O
O
O
O
B-LEGAL
O
O
B-LEGAL
I-LEGAL</p>
      <p>
        O
2.3.2 Modeling. The problem of detecting catchphrases was
modeled as customized NER. POS and custom NER tagging performed
during pre-processing stage were used to form secondary features.
These were used in building CRF model. CRF++ [5] toolkit was
used. Salient secondary features developed were:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Unigrams:
(a) Previous 3 tokens, current token and next 3 tokens
(b) Previous 3 POS tags, current POS tag and next 3 POS tags
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Bigram tokens
      </p>
      <sec id="sec-5-1">
        <title>CRF model was generated using:</title>
        <p>crf_learn template_file
train_file
model_file
The generated model file was then used to predict from test data:
crf_test
v1</p>
        <p>m model_file test_files
With v1 option the highest probability is shown as:</p>
      </sec>
      <sec id="sec-5-2">
        <title>Rockwell International Corp. NNP</title>
        <p>NNP
NNP</p>
        <p>B
I
I</p>
        <p>B/0.992465
I/0.979089</p>
        <p>I/0.954883
2.3.3 Results. The CRF++ model was used to predict custom
NER tags from the given testing data as:</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>AUTOMATIC PRECEDENCE DETECTION</title>
      <p>Automatic Precedent Retrieval System (APRS) is desirable in such
situation. One of the key functionality necessary in APRS is to find
“similar” cases so that they can be cited or referred from the case
being built. The notion of “similarity” has various connotations. In
the context of the given problem, it is said to be the documents
which share citations. In other words, the task is to find such prior
cases which are potentially cite-able from the case in hand.</p>
      <p>Legal texts are typically lengthy and unstructured in nature. It is
challenging to find similarity score among two texts by just
counting words or their frequency distributions or such preliminary
statistical measures. Need to embed higher level constructs such as
word embedding to introduce semantic similarity as well as higher
level clusters given by Topic Modeling.</p>
      <p>The aim of this section is to propose automatic cite-able texts
detection-prediction system for legal texts. It is unsupervised (no
labeled training data) technique comprising of Regular Expressions,
Word2Vec and Topic Modeling. All being employed to give
similarity score based on diferent aspects of the texts. Final rank is arrived
at using weighted sum of the individual scores.</p>
      <p>
        The contributions made in this system are as follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Detecting statute and Articles based on Regular Expressions.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Proposing cosine similarity between texts based on vector
generated by Word Embedding (word2vec/doc2vec).
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Proposing Document-Topics-Words distribution for all texts
and then scoring similarity based on common topic-words.
      </p>
      <p>The section has been structured as follows: in the following
section 3.1 the precedence detection task has been described in details,
as definition of the problem. In section 3.2, structure of the training
data has been explained. Next, the proposed system is elaborated
in section 3.3. It describes ranking method based on weighted-sum
of scores from individual methods. Section 4 discusses the findings
drawn from this work.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Task Definition</title>
      <p>Legal cases typically cite statute, Articles and previous cases
relevant to them. Thus, it is necessary to form association or similarity
between documents based on citations, so that it can be leveraged
for Precedent retrieval. Given sets of current and prior cases, the
task is: For each document in the first set, the participants are to form
a list of documents from the second set in a way that the cited prior
cases are ranked higher than the other (not cited) documents.[2]
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Data-set</title>
      <p>
        Fire-2017 [2] data-set contains following directories:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Current cases: A set of cases for which the prior cases have
to be retrieved, current _case&lt;i &gt; :txt where i = 0001 !
0200. Sample document looks like “**Judgment** IN THE
SUPREME COURT OF INDIA.. . . ”.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Prior cases: contains prior cases that were actually cited in
the case decision along with other (not cited) documents,
prior _case_ &lt; i &gt; :txt where i = 0001 ! 2000. Sample
document looks like “ 551; AIR 1996 SC 463; 1995 (6) SCC
315; 1995 (7) JT 225; 1995 (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) SCALE 690 (11 October 1995)
JEEVAN REDDY, B.P. (J) JEEVAN REDDY, . . . the assessee.
      </p>
      <p>No costs”.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>System Description</title>
      <p>The proposed APRS solution uses the following 3 distinct approaches
to determine similarity. The final similarity score is computed as a
weighed average of the scores generated by these 3 approaches.</p>
      <p>3.3.1 Regular Expressions based. Cases refer to statue in the
form of legal Articles, such as Article 270, 370, etc. Such
references can be extracted using Regular Expressions. In this system,
patterns like r ' article (\ d +)' are used for both, current as well
as prior cases. For a current case, all the prior cases are collected
which have same Articles.</p>
      <p>3.3.2 Topic Modeling based. Basic premise of this approach is
that if most of the topics extracted from documents match then
they are similar or cite-able. In this system, Document-Topics-Words
distribution is generated by Latent Dirichlet Allocation (LDA)
algorithm in gensim library[4]. Score of similarity is calculated based
on ratio of matching topic-words to the total.</p>
      <p>
        3.3.3 Doc2Vec Similarity based. Word2vec has emerged one of
the most popular vectorization based on semantic similarity [3].
Process to generate document vectors (based on word2vec) was:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Got every case as cleaned text, split it to form list of
words/tokens, for both, current and prior cases.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Created gensim TaggedDocument for each case text, giving
iflename as tag.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) A Map of tag to the content i.e. word-list for each cases were
generated and saved for reuse.
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) LDA model was built and saved. It was used to generate
document vectors for both current and prior cases.
      </p>
      <p>A similarity matrix was generated where current cases are rows
and prior cases as columns with values as cosine similarity
between document vectors of the current-prior case pair (row-column).
The values act as score for this particular approach.</p>
      <p>3.3.4 Results. Each current-prior case pair has a final score based
on weighted sum of scores from individual approaches mentioned
above. Due to lack of labeled training data, the weights were
decided heuristically. The results were presented as sorted list of prior
case for each current case, and looked as follows:</p>
      <p>Current case
current_case_0001
current_case_0001
current_case_0001
. . .
current_case_0116
current_case_0116
current_case_0116
. . .</p>
      <p>Prior case
prior_case_0780
prior_case_1256
prior_case_0838
prior_case_1533
prior_case_0411
prior_case_1600</p>
      <p>Rank
0
1
2
problem. CRF algorithm was chosen with primary features as POS
and custom NER tags and numerous secondary features
representing the context. As a future work, if suficient gold standard data
is available, one can explore more sophisticated techniques such
as Long Short Term Memory networks (LSTM), where custom
features need not be provided but get generated internally.</p>
      <p>In the second part, a brief overview of Automatic Citation
Prediction System was presented to discover cite-able prior cases. It
was found that the problem of citation detection needs to be
modeled as a mixed approach, employing rule based, machine learning
and deep learning based approaches rather than a simple cosine
similarity of tf-idf (term frequency inverse document frequency)
approach. Weighted sum of scores by individual approaches was
done. A threshold cut-of was decided to prune out irrelevant
citeable prior cases. Current cases and their predicted cite-able prior
cases were presented along with corresponding ranking scores.</p>
    </sec>
    <sec id="sec-10">
      <title>VITAE</title>
      <p>Yogesh H. Kulkarni works as Data Science Consultant and Trainer.
Profile: https://www.linkedin.com/in/yogeshkulkarni/
Rishabh Patil works as Data Engineer. His profile is at https://www.
linkedin.com/in/rishabh-patil-256a25124/.</p>
      <p>Srinivasan Shridharan is Data Scientist and entrepreneur.
Proifle: https://www.linkedin.com/in/srinivasan-shridharan-08a86a6/.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGMENTS</title>
      <p>Wish to thank Ankur Parikh, a keen researcher of Deep Learning
and NLP, for discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          <string-name>
            <surname>Bird</surname>
          </string-name>
          , Steven and
          <string-name>
            <given-names>Ewan</given-names>
            <surname>Klein</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Toolkit</article-title>
          .
          <source>(Sep</source>
          <year>2009</year>
          ). http://www.nltk.org/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>IRSI.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          .
          <source>Information Retrival Society of India. (Dec</source>
          <year>2017</year>
          ). https://sites.google.com/view/fire2017irled/ track-description
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Chris</given-names>
            <surname>McCormick</surname>
          </string-name>
          .
          <year>2016</year>
          . Word2Vec Tutorial - The
          <string-name>
            <surname>Skip-Gram Model</surname>
          </string-name>
          .
          <source>(Apr</source>
          <year>2016</year>
          ). http://mccormickml.com/
          <year>2016</year>
          /04/19/word2vec
          <article-title>-tutorial-the-skip-gram-model/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Radim</given-names>
            <surname>Rehurek</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Gensim: Topic Modeling for humans</article-title>
          .
          <source>(Sep</source>
          <year>2017</year>
          ). https: //radimrehurek.com/gensim/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Taku</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>CRF++: Yet Another CRF toolkit</article-title>
          .
          <source>(Sep</source>
          <year>2017</year>
          ). https://taku910.github. io/crfpp/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>