=Paper= {{Paper |id=Vol-2036/T3-9 |storemode=property |title=A Text Similarity Approach for Precedence Retrieval from Legal Documents |pdfUrl=https://ceur-ws.org/Vol-2036/T3-9.pdf |volume=Vol-2036 |authors=D. Thenmozhi,Kawshik Kannan,Chandrabose Aravindan |dblpUrl=https://dblp.org/rec/conf/fire/ThenmozhiKA17 }} ==A Text Similarity Approach for Precedence Retrieval from Legal Documents== https://ceur-ws.org/Vol-2036/T3-9.pdf
                 A Text Similarity Approach for Precedence Retrieval
                                from Legal Documents
                   D. Thenmozhi                             Kawshik Kannan                         Chandrabose Aravindan
           SSN College of Engineering                   SSN College of Engineering                  SSN College of Engineering
              Chennai, Tamilnadu                           Chennai, Tamilnadu                          Chennai, Tamilnadu
              theni_d@ssn.edu.in                          kawshik98@gmail.com                         aravindanc@ssn.edu.in

ABSTRACT                                                                from Legal Documents (IRLeD) collocated with the Forum for In-
Precedence retrieval of legal documents is an information retrieval     formation Retrieval Evaluation (FIRE), 2017. The track has two
task to retrieve prior case documents that are related to a given       tasks. Given a set of training cases with annotated catchphrases
case document. This helps in automatic linking of related docu-         and a set of test cases, the first task is to extract the catchphrases
ments to ensure that identical situations are treated similarly in      present in the test cases. The second task is to retrieve all the rele-
every case. Several methodologies, such as information extraction       vant prior cases for a given current case. Our focus is on the second
based on natural language processing, rule-based method, and ma-        task of IRLeD@FIRE2017.
chine learning techniques, are used to retrieve the prior cases with
respect to the current case. In this paper, we propose a text simi-     2 PROPOSED APPROACH
larity approach for precedence retrieval to retrieve older cases that
                                                                        We have implemented a document similarity approach for this IRLeD
are similar to a given case from a set of legal documents. Lexical
                                                                        precedence retrieval task. We have used three variations of our ap-
features are extracted from all the legal documents and the simi-
                                                                        proach namely i. Method-1 with concepts and TF-IDF (Term Fre-
larity between each current case document and all the prior case
                                                                        quency - Inverse Document Frequency) scores, ii. Method-2 with
documents are determined using cosine similarity scores. The list
                                                                        concepts, relations and TF-IDF scores, and iii. Method-3 with con-
of prior case documents are ranked based on the similarity scores
                                                                        cepts, relations and Word2Vec. We have implemented our method-
for each current case document. We have evaluated our approach
                                                                        ology in Python for this IRLeD task. The data set used to evaluate
using the data set given by IRLeD@FIRE2017 shared task.
                                                                        the Task 2 (Precedence retrieval task) of IRLeD shared task consists
                                                                        of 200 current case documents and 2000 prior case documents. The
KEYWORDS                                                                steps used in our approach are given below.
Precedence Retrieval, Information Retrieval, Document Similarity,
Legal Documents                                                             • Preprocess the given text
                                                                            • Extract linguistics features from both current case documents
                                                                              and prior case documents
1 INTRODUCTION
                                                                            • Construct feature vectors for the documents using TF-IDF
Precedence retrieval is the process of retrieving relevant prior doc-         score or Word2Vec
uments with respect to a current document. This is very impor-              • Find cosine similarity score between each current case with
tant in common law system where a prior case which discusses                  all the prior cases
similar issues can be used as a reference in the current case. This         • Rank prior cases based on the similarity score for each cur-
is to ensure that identical situations are treated similarly in ev-           rent case
ery case. Recently, the number of digitally available legal docu-
ments has increased rapidly due to the developments in informa-         The steps used in all the three methods are explained in detail in
tion technology. An automatic precedence retrieval system from          the sequel.
legal documents helps legal practitioners to easily refer to the ear-
lier cases that are related to the current case. Such a precedence
retrieval system has several applications such as case based rea-       2.1 Method-1 with concepts and TF-IDF scores
soning [2][8], legal citations and legal information retrieval [9].     The prior case documents and the current case documents are pre-
Several approaches, such as information extraction based on natu-       processed by removing the punctuations like “, ”, - , ‘, ’, _, and
ral language processing [4], rule-based aprroach [3], and machine       the string ‘[?CITATION?]’ which is part of the text. The text is
learning techniqes [1], are used to retrieve the prior cases with re-   annotated with parts of speech (POS) information such as noun,
spect to the current case. We propose to use a text/document sim-       verb, determiner, adverb, and adjective. In this method, only nouns
ilarity approach for precedence retrieval to retrieve relevant older    are considered to obtain the concepts. All forms of nouns (NN*)
cases for the current case from legal documents. In this work, we       namely NN, NNS and NNP are extracted from both current case
have focused on the shared task of IRLeD@FIRE20171 [6] which            text and prior case text and are lemmatized. The feature set is con-
aims to retrieve prior case documents for a given current case doc-     structed by eliminating all duplicate terms from the lemmatized
ument. IRLeD@FIRE2017 is a shared Task on Information Retrieval         terms. The feature vector for each document is constructed using
                                                                        TF-IDF scores with respect to the features from the feature set. The
1 https://sites.google.com/view/fire2017irled                           cosine similarity scores between each current case document and
all the prior case documents are determined. The prior case docu-                               Table 1: IRLeD Task 2 Performance
ments are ranked based on the similarity score and are retrieved
for each current case document.                                                        Method       MAP        MRR      Precision@10      Recall@10
   We have used NLTK tool kit2 to preprocess and annotate the
                                                                                      Method 1      0.2633    0.5176        0.1795           0.681
given data with POS information. The extracted concepts from
                                                                                      Method 2      0.2677    0.5457         0.178           0.669
POS information are lemmatized using Wordnet Lemmatizer. The
                                                                                      Method 3      0.101     0.277         0.0755           0.435
TF-IDF scores are obtained for the features by using scikit-learn3
library (TfidfVectorizer from sklearn.feature_extraction.text). The
similarity between each current case and the prior cases are ob-                 of mean average precision and mean reciprocal rank with the val-
tained using scikit cosine_similarity from sklearn.metrics.pairwise.             ues 0.2677 and 0.5457 respectively. Method-1 that considers only
The prior cases for each current case are ranked based on the sim-               concepts as features gives better results for precision@10 and re-
ilarity scores (the prior case with highest similarity score is re-              call@10 with the values 0.1795 and 0.681 respectively. However,
trieved first).                                                                  our third method does not perform well for this precedence re-
                                                                                 trieval of legal documents. The average of vectors used to represent
2.2 Method-2 with concepts, relations and                                        the documents may not be a suitable solution. The performance
    TF-IDF scores                                                                may be improved if we use Doc2Vec [5], an extension of Word2Vec
In Method-2, we have considered both concepts and relations as                   for vector representation.
features. All forms of nouns (NN*) namely NN, NNS and NNP to
obtain the concepts and all forms of verbs (VB*) namely VB, VBZ,                 ACKNOWLEDGMENTS
VBN, and VBD to obtain the relations are extracted from both cur-                We would like to thank the management of SSN Institutions for
rent cases and prior cases POS information. The other steps like                 funding the High Performance Computing (HPC) lab where this
lemmatization, construction of feature vectors using TF-IDF, find-               work is being carried out.
ing cosine similarity and ranking are similar to Method-1.
                                                                                 REFERENCES
2.3 Method-3 with concepts, relations and                            [1] Khalid Al-Kofahi, Alex Tyrrell, Arun Vachher, and Peter Jackson. 2001. A machine
                                                                         learning approach to prior case retrieval. In Proceedings of the 8th international
    Word2Vec                                                             conference on Artificial intelligence and law. ACM, 88–93.
In Method-3, the key terms are extracted by using concepts and re-   [2] Ramon Lopez De Mantaras, David McSherry, Derek Bridge, David Leake, Barry
                                                                         Smyth, Susan Craw, Boi Faltings, Mary Lou Maher, MICHAEL T COX, Kenneth
lations for each case from current set and prior set. The terms with     Forbus, et al. 2005. Retrieval, reuse, revision and retention in case-based reason-
respect to a particular case are lemmatized and vectorized into an       ing. The Knowledge Engineering Review 20, 3 (2005), 215–240.
array of dimensions 300 using Word2Vec [7]. The average of all       [3] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2015. Lexa: Building
                                                                         knowledge bases for automatic legal citation classification. Expert Systems with
the term vectors of the document is determined and that average          Applications 42, 17 (2015), 6391–6407.
represents the vector for the document. Likewise, the vector repre-  [4] Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Arun Vachher. 2003. Informa-
sentations of all the prior case documents and the current case doc-     tion extraction from case law and retrieval of prior cases. Artificial Intelligence
                                                                         150, 1-2 (2003), 239–290.
uments are obtained. Similar to the other two methods, the cosine    [5] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
similarity scores between each current case document and all the         documents. In Proceedings of the 31st International Conference on Machine Learn-
                                                                         ing (ICML-14). 1188–1196.
prior case documents are determined. The prior case documents        [6] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap-
are ranked based on the similarity scores and are retrieved for each     tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from
current case document.                                                   Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information
                                                                         Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org.
   In this method, the key terms are obtained by extracting the      [7] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient esti-
terms that are tagged with NN, NNS, NNP, VB, VBN, VBZ and VBD.           mation of word representations in vector space. arXiv preprint arXiv:1301.3781
Each key term is lemmatized using Wordnet Lemmatizer and vec-            (2013).
                                                                     [8] Chieh-Yuan Tsai and Chuang-Cheng Chiu. 2009. Developing a Significant Nearest
torized using Word2Vec KeyedVectors.load_word2vec_format from            Neighbor Search Method for Effective Case Retrieval in a CBR System. In Com-
gensim.models.keyedvectors with 300 dimensions. We have used GoogleNews- puter Science and Information Technology-Spring Conference, 2009. IACSITSC’09.
                                                                         International Association of. IEEE, 262–266.
vectors-negative300.bin.gz 4 for this vectorization.                 [9] Marc Van Opijnen and Cristiana Santos. 2017. On the concept of relevance in
                                                                         legal information retrieval. Artificial Intelligence and Law 25, 1 (2017), 65–87.
3 RESULTS AND DISCUSSIONS
We have evaluated our document similarity approach for prece-
dence retrieval of legal documents based on the metrics namely
Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Pre-
cision@10 and Recall@10. The results of our approach are given
in Table 1.
   Method-2 which considers both concepts and relations from the
text as features performs better than the other methods in terms
2 http://www.nltk.org/
3 http://scikit-learn.org/
4 https://github.com/mmihaltz/word2vec-GoogleNews-vectors

                                                                             2