Removing Named Entities to Find Precedent Legal Cases Ravina More1, Jay Patil2, Abhishek Palaskar2 and Aditi Pawde1 1 Tata Consultancy Services, Tata Research Development and Design Centre, Pune, India 2 College of Engineering, Pune, India ravina.m@tcs.com Abstract. In this paper, we present the solution of the team TRDDC Pune for the Artificial Intelligence in Legal Assistance(AILA) track 1 task on Precedent Re- trieval in FIRE 2019. The task was to identify relevant legal prior cases for a legal query from a dataset of about 2,914 documents of cases that were judged in the Supreme Court of India. We used Named Entity Recognition to preprocess the case documents and the input query. We then ranked the preceding case docu- ments using TF-IDF and BM25 algorithms. The results of our approach are com- parable to the top ranked run on the task leaderboard. Keywords: Legal Analytics, Information Retrieval, Legal Precedents, Named Entity Recognition, TF-IDF, BM25 1 Introduction In countries following the ‘Common Law System’ (e.g. UK, USA, Canada, Australia, India), prior cases – also known as Precedents, are a primary repository of information for lawyers. By understanding how the Court1 has dealt with similar scenarios in the past, a lawyer can prepare the legal reasoning accordingly. When a lawyer is presented with a new case, she/he has to go through the Precedents to find out where does his legal problem fit and what was the outcome of similar cases in the past. Going through all the Precedents manually involves scanning a large repos- itory, reading through the cases, and finding out the most relevant part in the case doc- ument. This process is time consuming. Thus, it is beneficial to have a system that can automatically and efficiently search for a case that you are interested in and find the most relevant Precedents We present here our solution that uses Natural Language Pro- cessing and Information Retrieval Techniques to find relevant Precedents for a given Query for the FIRE 2019[1] Challenge Task 1 of identifying relevant prior cases. 1 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 December 2019, Kolkata, India. 2 2 Related Work In the past, substantial work has been done on designing and constructing the corpora of legal cases for legal retrieval. Ontologies and Natural Language Processing are being used to extract case factors and participant roles[2]. Yin et. al[3], demonstrate an ap- proach to query search engines using a document. Our problem statement is similar to theirs as it involves querying using a set of sentences. Their approach works on extract- ing and scoring key phrases from the query, expanding them with related key phrases and using these in the search engine to find documents containing these concepts. While their approach is based on finding noun key phrases in the query, we are more interested in the overall situation of a given query. We took inspiration from their work to select interesting portions in the query and perform ranking of case documents based on them. 3 Problem and Data Description Artificial Intelligence for Legal Assistance (AILA) track challenge had 2 subtasks. Sub-task 1 was about identifying relevant prior cases. The participants were provided with 2,914 case documents that were judged in the Supreme Court of India. The partic- ipants were provided with 50 legal queries, each describing a situation. The task was to retrieve the most relevant Precedents among the 2,914 case documents for a given query. A set of 2-3 relevant case documents was provided per query for the first 10 queries as test data. The participants had to perform relevance ranking for the remaining 40 queries. Refer [1] for more details. For the submission, each query returned a ranked set of prior cases that were judged to be relevant to the query. The relevance of a case document was ranked between 0 to 1 (1 indicating most relevant). The results were evaluated using trec_eval. 4 Methodology To find the relevant Precedents for a given query we followed the following steps: Step 1: Pre-process all the case documents to build a search corpus (Section 4.2) Step 2: Pre-process the query (Section 4.3) Step 3: Rank the Precedents from the corpus using the query (Section 4.4) 4.1 Intuition The queries and the case documents contained substantial information about names, places, organizations, currencies, time, etc. that are specific to the case (E.g. ‘Gov. of Tamil Nadu’, ‘Indian Oil Corporation’, ‘13 Rs.’, ‘Jan- uary afternoon’, etc.). Such information can be ignored to focus on events such as ‘murder’, ‘bribery’, ‘stole’, etc. that give primary information about the situation to perform relevance ranking. 3 4.2 Pre-processing of Case Documents: As the first step, we prepared the corpus according to our intuition for query extraction. We used spaCy[7] for preprocessing. We performed the following steps on the 2,914 case documents: 1. Paragraph Splitting of case documents: A case document can contain 10-40 paragraphs on an average. These paragraphs give information about the background of the case, the situation and the judge- ments. We were interested to compare the results of performing a match on the whole case document versus on these individual paragraphs. So, we split every case document into individual paragraphs.  Improvement after submission: We decided to take the entire case document without splitting it into paragraphs 2. Tokenization and Named Entity Recognition (NER) of paragraphs: We used spaCy’s tokenization to break down the paragraphs into individual words called tokens. We performed NER on the tokenized sentences to find named entities such as places, things, person, currency, time, etc. 3. Removal of Named Entities and Stop Words: Using the Named Entities identified in the previous step and a predefined list of stop words by spaCy, we removed the Named Entities and the Stop Words from the case documents (Fig. 1). 1. These appeals are filed ['1', '', 'these', 'appeals', 'are', against the order dated 'filed', 'against', 'the', 'order', 'dat- 29.3.2001 passed by the ed', '29.3.2001', 'passed', 'by', 'the', Madras High Court allowing 'madras', 'high', 'court', 'allowing', Crl.O.P. Nos.2418 of 1999. 'crl.o.p.', 'nos.2418', 'of', '1999'] 2. The appellant (Indian Oil Corporation, for short ['2', 'the', 'appellant', 'indian', 'IOC') entered into two 'oil', 'corporation,', 'for', 'short', contracts, one with the 'ioc', 'entered', 'into', 'two', 'con- first respondent (NEPC In- tracts,', 'one', 'with', 'the', 'first', dia Ltd.) and the other 'respondent', 'nepc', 'india', 'ltd', with its sister company 'and', 'the', 'other', 'with', 'its', Skyline NEPC Limited ('Sky- 'sister', 'company', 'skyline', 'nepc', line' for short). According 'limited', 'skyline', 'for', 'short', to the appellant, in re- 'agreeing', 'to', 'supply', 'to', 'them', spect of the aircraft fuel 'according', 'to', 'the', 'appellant,', supplied under the said 'in', 'respect', 'of', 'the', 'aircraft', contracts, the first re- 'fuel', 'supplied', 'under', 'the', spondent became due in a 'said', 'contracts,', 'the', 'first', sum of Rs.5,28,23,501 and 'respondent', 'became', 'due', 'in', 'a', Skyline became due in a sum 'sum', 'of', 'rs.5,28,23,501', 'and', of Rs.13,12,76,421 as on 'skyline', 'became', 'due', 'in', 'a', 29.4.1997. 'sum', 'of', 'rs.13,12,76,421', 'as', 'on', '29.4.1997.'] Before Preprocessing of Case Doc After Preprocessing of Case Doc Fig. 1. Pre-processing of Paragraph 4 4.3 Pre-processing of the Query On reading the queries, we found out that the queries contained information such as background about the situation, the situation itself, subject of the appeal and partici- pants. We define Appeal Context as the set of sentences in the query that describe the appeal. As we were interested in the information of the appeal only, we extracted this information from the query by finding the Appeal Context and then preprocessing this context. 1. Extract the Appeal Context: We observed that most of the queries contained some key-words that help us to identify the context. We used the following list of appeal related key- words:['appeal', 'appeals', 'trial', 'hearing', 'plead', 'pleaded', 'appealing', 'cross-appeal', 'quash']. We selected 15 sentences per query containing and surrounding these key words. For queries that did not contain any of these key words or were shorter than 15 sentences, we selected the entire query as the Appeal Context.  Improvement after submission: We decided to take all the sentences in the query as the Appeal Context. 2. Tokenization, Removal of Named Entities and Stop Words: We performed tokenization of the selected sentences, remove Named Entities and Stop Words of the Appeal Context(similar to pre-processing of case docu- ments). 4.4 Performing Precedent Retrieval BM25[4] is ‘bag-of-words’ ranking function that estimates the relevance of documents provided to a search query. Term Frequency Inverse Document Frequency(TF-IDF)[5] is a measure that helps to identify words in collection of documents that aid to defining the topic of the document. We used the gensim[6] implementations of BM25 and TF- IDF. We used the cleaned appeal as query, cleaned case documents as corpus and BM25 and TF-IDF algorithms to rank the case documents. Using BM25, TF-IDF and an ensemble of BM25-TF-IDF, we found the score for every paragraph in every case document for a given query. The final score of a case document for a given query is the mean of the scores of the top 3 paragraphs of the case document. We ranked the case documents on a scale of 0 (least relevant) to 1 (most relevant) based on these scores.  Improvement after submission: We cleaned and used the whole case documents (without paragraph splitting) and the entire query (without selecting the Appeal Context) for relevance ranking using BM25 and TF-IDF. 5 Result and Analysis Table 1 shows the performance of the runs that we submitted. The results of ‘HLJIT2019-AILA_task1_2’ run which topped the leaderboard are given for reference. Our runs appeared in the top 10 in the leaderboard. 5 Run ID P@10 MAP BPREF Reciprocal Rank st HLJIT2019-AILA_task1_2 (1 ) 0.07 0.1492 0.1286 0.288 TFIDF (5th) 0.05 0.0956 0.067 0.203 Ensemble (7th) 0.04 0.0817 0.0591 0.162 BM25 (8th) 0.0375 0.0773 0.0547 0.151 Table 1. Comparison of the performance of the different ranking approaches 5.1 Improvements after submission After the organizers made the test data public, we performed ablation analysis and re- alized that the splitting of case documents to paragraphs and selection of the Appeal Context were not improving the results, and were in fact deteriorating it. This could be because narrowing down the query and restricting the query search to just the para- graphs led to missing out some key information for comparison. In fact, the simple removal of Named Entities (NE) in both case documents as well as queries improved the ranking results substantially. Table 2 shows the results. Removed Removed P_10 MAP BPREF Recip. NE from NE from Rank Case Docs Query TRUE TRUE 0.07 0.1743 0.1535 0.2771 TFIDF TRUE FALSE 0.07 0.1723 0.1504 0.2738 FALSE TRUE 0.0575 0.1319 0.1204 0.1949 FALSE FALSE 0.0625 0.1644 0.1468 0.2449 TRUE TRUE 0.0575 0.128 0.1163 0.2424 BM25 TRUE FALSE 0.0575 0.1261 0.1123 0.238 FALSE TRUE 0.05 0.1274 0.11 0.2545 FALSE FALSE 0.05 0.1487 0.1362 0.2679 Table 2. Comparison of Results after Submission Removal of Named Entities from both query and case helped in making the comparison more generic. For example, this resulted in all the bribery cases whether they happened in a police station, bank or some private company to be treated equally. According to Table 2. TF-IDF as well as BM25 performed the best when the named entities were removed from the query as well as the case documents. At the same time, TF-IDF per- formed better than BM25 in all the cases. 6 6 Conclusion and Future Work We have presented our approach for finding the relevant Precedents in the Task 1 in AILA track in FIRE 2019. After improvements, we found out that simply removing the named entities gave the best results. These results are comparable to the highest ranked approach on the leaderboard. The BM25 and TF-IDF algorithms used in this approach are both word-matching based algorithms for relevance ranking. As a result, a query containing ‘kill’ does not get matched to a case document containing ‘murder’. The lack of exact matches prevented some of the case documents from getting a higher rank in spite of the situa- tion being the same. In the future, we plan to further improve our technique by consid- ering the meaning of the words using word vectors while performing relevance rank- ings. 7 Acknowledgement We would like to thank Girish Palshikar, Sachin Pawar, Dr. Kripabandhu Ghosh and Nitin Ramrakhiyani from TRDDC, Pune for their guidance during our brainstorming sessions. We also thank Dr. Vahida Attar, HOD, Department of Computer and IT, COEP, Pune for her support. References 1. P. Bhattacharya, K. Ghosh, S. Ghosh, A. Pal, P. Mehta, A. Bhattacharya., P. Majumder, Overview of the Fire 2019 AILA track: Artificial Intelligence for Legal Assistance. In Proc. of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12- 15, 2019. 2. Wyner A., Mochales-Palau R., Moens MF., Milward D. (2010) Approaches to Text Mining Arguments from Legal Cases. In: Francesconi E., Montemagni S., Peters W., Tiscornia D. (eds) Semantic Processing of Legal Texts. Lecture Notes in Computer Science, vol 6036. Springer, Berlin, Heidelberg 3. Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., & Papadias, D. (2009, February). Query by document. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (pp. 34-43). ACM. 4. Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4), 333-389. 5. Rajaraman, A.; Ullman, J.D. (2011). Data Mining. Mining of Massive Datasets. pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 978-1-139-05845-2. 6. Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large cor- pora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frame- works. 7. Honnibal, Matthew, and Ines Montani. "spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing." To appear 7 (2017).