=Paper=
{{Paper
|id=Vol-2036/T3-9
|storemode=property
|title=A Text Similarity Approach for Precedence Retrieval from Legal Documents
|pdfUrl=https://ceur-ws.org/Vol-2036/T3-9.pdf
|volume=Vol-2036
|authors=D. Thenmozhi,Kawshik Kannan,Chandrabose Aravindan
|dblpUrl=https://dblp.org/rec/conf/fire/ThenmozhiKA17
}}
==A Text Similarity Approach for Precedence Retrieval from Legal Documents==
A Text Similarity Approach for Precedence Retrieval from Legal Documents D. Thenmozhi Kawshik Kannan Chandrabose Aravindan SSN College of Engineering SSN College of Engineering SSN College of Engineering Chennai, Tamilnadu Chennai, Tamilnadu Chennai, Tamilnadu theni_d@ssn.edu.in kawshik98@gmail.com aravindanc@ssn.edu.in ABSTRACT from Legal Documents (IRLeD) collocated with the Forum for In- Precedence retrieval of legal documents is an information retrieval formation Retrieval Evaluation (FIRE), 2017. The track has two task to retrieve prior case documents that are related to a given tasks. Given a set of training cases with annotated catchphrases case document. This helps in automatic linking of related docu- and a set of test cases, the first task is to extract the catchphrases ments to ensure that identical situations are treated similarly in present in the test cases. The second task is to retrieve all the rele- every case. Several methodologies, such as information extraction vant prior cases for a given current case. Our focus is on the second based on natural language processing, rule-based method, and ma- task of IRLeD@FIRE2017. chine learning techniques, are used to retrieve the prior cases with respect to the current case. In this paper, we propose a text simi- 2 PROPOSED APPROACH larity approach for precedence retrieval to retrieve older cases that We have implemented a document similarity approach for this IRLeD are similar to a given case from a set of legal documents. Lexical precedence retrieval task. We have used three variations of our ap- features are extracted from all the legal documents and the simi- proach namely i. Method-1 with concepts and TF-IDF (Term Fre- larity between each current case document and all the prior case quency - Inverse Document Frequency) scores, ii. Method-2 with documents are determined using cosine similarity scores. The list concepts, relations and TF-IDF scores, and iii. Method-3 with con- of prior case documents are ranked based on the similarity scores cepts, relations and Word2Vec. We have implemented our method- for each current case document. We have evaluated our approach ology in Python for this IRLeD task. The data set used to evaluate using the data set given by IRLeD@FIRE2017 shared task. the Task 2 (Precedence retrieval task) of IRLeD shared task consists of 200 current case documents and 2000 prior case documents. The KEYWORDS steps used in our approach are given below. Precedence Retrieval, Information Retrieval, Document Similarity, Legal Documents • Preprocess the given text • Extract linguistics features from both current case documents and prior case documents 1 INTRODUCTION • Construct feature vectors for the documents using TF-IDF Precedence retrieval is the process of retrieving relevant prior doc- score or Word2Vec uments with respect to a current document. This is very impor- • Find cosine similarity score between each current case with tant in common law system where a prior case which discusses all the prior cases similar issues can be used as a reference in the current case. This • Rank prior cases based on the similarity score for each cur- is to ensure that identical situations are treated similarly in ev- rent case ery case. Recently, the number of digitally available legal docu- ments has increased rapidly due to the developments in informa- The steps used in all the three methods are explained in detail in tion technology. An automatic precedence retrieval system from the sequel. legal documents helps legal practitioners to easily refer to the ear- lier cases that are related to the current case. Such a precedence retrieval system has several applications such as case based rea- 2.1 Method-1 with concepts and TF-IDF scores soning [2][8], legal citations and legal information retrieval [9]. The prior case documents and the current case documents are pre- Several approaches, such as information extraction based on natu- processed by removing the punctuations like “, ”, - , ‘, ’, _, and ral language processing [4], rule-based aprroach [3], and machine the string ‘[?CITATION?]’ which is part of the text. The text is learning techniqes [1], are used to retrieve the prior cases with re- annotated with parts of speech (POS) information such as noun, spect to the current case. We propose to use a text/document sim- verb, determiner, adverb, and adjective. In this method, only nouns ilarity approach for precedence retrieval to retrieve relevant older are considered to obtain the concepts. All forms of nouns (NN*) cases for the current case from legal documents. In this work, we namely NN, NNS and NNP are extracted from both current case have focused on the shared task of IRLeD@FIRE20171 [6] which text and prior case text and are lemmatized. The feature set is con- aims to retrieve prior case documents for a given current case doc- structed by eliminating all duplicate terms from the lemmatized ument. IRLeD@FIRE2017 is a shared Task on Information Retrieval terms. The feature vector for each document is constructed using TF-IDF scores with respect to the features from the feature set. The 1 https://sites.google.com/view/fire2017irled cosine similarity scores between each current case document and all the prior case documents are determined. The prior case docu- Table 1: IRLeD Task 2 Performance ments are ranked based on the similarity score and are retrieved for each current case document. Method MAP MRR Precision@10 Recall@10 We have used NLTK tool kit2 to preprocess and annotate the Method 1 0.2633 0.5176 0.1795 0.681 given data with POS information. The extracted concepts from Method 2 0.2677 0.5457 0.178 0.669 POS information are lemmatized using Wordnet Lemmatizer. The Method 3 0.101 0.277 0.0755 0.435 TF-IDF scores are obtained for the features by using scikit-learn3 library (TfidfVectorizer from sklearn.feature_extraction.text). The similarity between each current case and the prior cases are ob- of mean average precision and mean reciprocal rank with the val- tained using scikit cosine_similarity from sklearn.metrics.pairwise. ues 0.2677 and 0.5457 respectively. Method-1 that considers only The prior cases for each current case are ranked based on the sim- concepts as features gives better results for precision@10 and re- ilarity scores (the prior case with highest similarity score is re- call@10 with the values 0.1795 and 0.681 respectively. However, trieved first). our third method does not perform well for this precedence re- trieval of legal documents. The average of vectors used to represent 2.2 Method-2 with concepts, relations and the documents may not be a suitable solution. The performance TF-IDF scores may be improved if we use Doc2Vec [5], an extension of Word2Vec In Method-2, we have considered both concepts and relations as for vector representation. features. All forms of nouns (NN*) namely NN, NNS and NNP to obtain the concepts and all forms of verbs (VB*) namely VB, VBZ, ACKNOWLEDGMENTS VBN, and VBD to obtain the relations are extracted from both cur- We would like to thank the management of SSN Institutions for rent cases and prior cases POS information. The other steps like funding the High Performance Computing (HPC) lab where this lemmatization, construction of feature vectors using TF-IDF, find- work is being carried out. ing cosine similarity and ranking are similar to Method-1. REFERENCES 2.3 Method-3 with concepts, relations and [1] Khalid Al-Kofahi, Alex Tyrrell, Arun Vachher, and Peter Jackson. 2001. A machine learning approach to prior case retrieval. In Proceedings of the 8th international Word2Vec conference on Artificial intelligence and law. ACM, 88–93. In Method-3, the key terms are extracted by using concepts and re- [2] Ramon Lopez De Mantaras, David McSherry, Derek Bridge, David Leake, Barry Smyth, Susan Craw, Boi Faltings, Mary Lou Maher, MICHAEL T COX, Kenneth lations for each case from current set and prior set. The terms with Forbus, et al. 2005. Retrieval, reuse, revision and retention in case-based reason- respect to a particular case are lemmatized and vectorized into an ing. The Knowledge Engineering Review 20, 3 (2005), 215–240. array of dimensions 300 using Word2Vec [7]. The average of all [3] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2015. Lexa: Building knowledge bases for automatic legal citation classification. Expert Systems with the term vectors of the document is determined and that average Applications 42, 17 (2015), 6391–6407. represents the vector for the document. Likewise, the vector repre- [4] Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Arun Vachher. 2003. Informa- sentations of all the prior case documents and the current case doc- tion extraction from case law and retrieval of prior cases. Artificial Intelligence 150, 1-2 (2003), 239–290. uments are obtained. Similar to the other two methods, the cosine [5] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and similarity scores between each current case document and all the documents. In Proceedings of the 31st International Conference on Machine Learn- ing (ICML-14). 1188–1196. prior case documents are determined. The prior case documents [6] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap- are ranked based on the similarity scores and are retrieved for each tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from current case document. Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org. In this method, the key terms are obtained by extracting the [7] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient esti- terms that are tagged with NN, NNS, NNP, VB, VBN, VBZ and VBD. mation of word representations in vector space. arXiv preprint arXiv:1301.3781 Each key term is lemmatized using Wordnet Lemmatizer and vec- (2013). [8] Chieh-Yuan Tsai and Chuang-Cheng Chiu. 2009. Developing a Significant Nearest torized using Word2Vec KeyedVectors.load_word2vec_format from Neighbor Search Method for Effective Case Retrieval in a CBR System. In Com- gensim.models.keyedvectors with 300 dimensions. We have used GoogleNews- puter Science and Information Technology-Spring Conference, 2009. IACSITSC’09. International Association of. IEEE, 262–266. vectors-negative300.bin.gz 4 for this vectorization. [9] Marc Van Opijnen and Cristiana Santos. 2017. On the concept of relevance in legal information retrieval. Artificial Intelligence and Law 25, 1 (2017), 65–87. 3 RESULTS AND DISCUSSIONS We have evaluated our document similarity approach for prece- dence retrieval of legal documents based on the metrics namely Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Pre- cision@10 and Recall@10. The results of our approach are given in Table 1. Method-2 which considers both concepts and relations from the text as features performs better than the other methods in terms 2 http://www.nltk.org/ 3 http://scikit-learn.org/ 4 https://github.com/mmihaltz/word2vec-GoogleNews-vectors 2