1. Introduction

Track: Legal domain search with minimal domain knowledge

Tobias Fink

tobias.fink@tuwien.ac.at 0

Gabor Recski

gabor.recski@tuwien.ac.at 0

Allan Hanbury

hanbury@ifs.tuwien.ac.at 0 0 TU Wien, Faculty of Informatics, Research Unit E-commerce

2020

We tackle task1 in the AILA 2020 shared task, where the goal is to retrieve precedent and statute documents related to a case document query in the Indian legal domain. We use BM25 with simple hyperparameter tuning and preprocessing for both precedent and statute retrieval and achieve a Mean Average Precision (MAP) of 0.1294 and 0.2619, respectively. We also experiment with removing frequent terms from the query as well as removing terms that produce high scores only in irrelevant documents, but both methods fail to improve the baseline results.

information retrieval legal domain BM25

1. Introduction

In domain specific information retrieval (IR) each domain comes with its own challenges and its own language. Getting an understanding of the domain specific language and knowing which words and phrases can help to distinguish documents is important for IR, but unfortunately the intricacies of such a language are often dificult to understand and only known by domain experts. For example, in the case law system, there is the need to retrieve precedents and relevant statutes for legal documents, like cases. However, due to the length of such a document, it can contain passages about several topics, not all of which are helpful in distinguishing between documents, and might contain terms of which only some are related to relevant facts and rules.

In Task 1 of the FIRE2020 AILA track[ 1 ], the goal is to retrieve relevant precedent cases judged by the Indian Supreme Court and statutes from Indian law for queries consisting of legal case documents. A training set consisting of legal document queries as well as relevant and irrelevant precedent and statute documents is provided. It can be challenging for a non-expert to understand why a document is relevant or irrelevant for a particular query, because sometimes relevant and irrelevant documents seemingly deal with similar topics. This is more pronounced with longer documents, such as precedent case documents.

To tackle this task using only the relevance information and the text data of the provided training set, we use the well known BM25 document ranking algorithm[ 2 ] (implemented by the nEvelop-O (A. Hanbury) open-source Lucene-based search tool elasticsearch1) and determine hyperparameters for BM25 based on a random search on the training set. Further, we use simple heuristics to detect query terms that could be harmful to the desired search outcome and remove them from the query. The heuristics decide whether a term should be removed based on the frequency of each term in the corpus and the BM25 scores of each query term across a set of relevant and irrelevant documents.

2. Dataset and Related Work

The dataset is partially taken from last year’s AILA2019 [ 3 ] and consists of 3.257 prior case documents (2914 old, 343 new) and 197 statute documents as well as 50 training queries for which the relevance of precedents and statutes is known and 10 test queries. The queries consist of a paragraph of raw text and are as such rather diferent from search queries typically entered into web search engines, which are usually much shorter. The mean (standard deviation) of relevant documents per query is 3.9 (3.82) and 4.42 (0.67) for precedents and statutes, respectively. For the AILA2019, there where many submissions successfully employing BM25 in some form or another. One of the top performers in both precedent and statute retrieval, Zhao et al. [ 4 ] employ a new relevance score created by calculating BM25 on a filtered query document and an unfiltered query document and adding the two scores. The filtering is done by ranking the query terms according to their IDF-scores and taking the top 50 highest scoring terms. Additionally, they also experiment with a Word2Vec based similarity function, which works well for statute retrieval but not precedent retrieval. Similarly, Gao et al. [ 5 ] submitted runs using TF-IDF or Textrank to first extract the top 60 to 80 words from the query and using a Vector Space Model (VSM), BM25 and a Language Model (LM) for retrieval. For Task 1 the TF-IDF based query extraction paired with the VSM achieves 2nd place, followed by TF-IDF paired with BM25 achieving 4th place. They did not submit any runs for Task 2. For Task 1, Shao et al. [ 6 ] extract sentences containing the phrase “high court” as key sentences and utilize VSM, LM and a VSM + Mixed-Size Bigram Model combination but only achieve rank 10 for this task. For Task 2, they use the entire description and utilize VSM, LM and a VSM + BM25 combination. They achieve rank 1 (VSM), rank 2 (VSM+BM25) and rank 3 (LM) for statute retrieval. While in these works the hyperparameters of BM25 1 and are set to static values, the choice which values should be selected is also topic of research in IR. For example, in Lipani et al. [ 7 ], is instead calculated based on the mean average term frequency of a collection. Due to the overall good performance of BM25 based methods, we also opt to experiment with this retrieval method.

3. Methodology

To retrieve data from the corpus, we create two indices using elasticsearch2, one for precedents and one for statutes. While elasticsearch has their own stack of text pre-processing and analysis tools, we opt to perform our text preprocessing outside of elasticsearch, since some of the desired functionality, like lemmatization, is not supported by elasticsearch. Instead we use the 1https://github.com/elastic/elasticsearch 2https://www.elastic.co/elastic-stack where is the document to be scored, is the query, is a query term/token occurring in the query, (, ) in tokens and is the term frequency of token in document , || is the length of the document is the average document length for documents in the collections. Further, 1 and are hyperparameters and () formula:

is the inverse document frequency calculated by this () = ( − () + 0.5 () + 0.5 + 1) where is the total number of documents and () is the number of documents containing query term .

3.2. Hyperparameter Search

open source natural language processing library spacy3 to tokenize the document text. We further clean the text by removing punctuation tokens, numbers and typical English stopwords. Finally, the tokens for each document are lemmatized, lowercased and then added to a single indexed field. We generate our queries by applying the same procedure to the query documents, but since the query documents can be very long and occasionally exceed the elasticsearch max clause limit of 1024, we remove all duplicate tokens from the resulting list of tokens.

3.1. Ranking Method

We score the documents using the commonly used Okapi BM25 ranking function (as is implemented in elasticsearch), which is calculated using the following formula: 25(, ) = ∑ () ⋅ ∈ (, ) ⋅ (

1 + 1) (, ) + 1 ⋅ (1 − + ⋅ || ) (1) (2) (3) (4) Since we do not know what the best hyperparameter values for 1 and are for our two tasks, we decided to experiment with the selection of the values. Instead of taking the often used values of 1 = 1.2 and = 0.75 , we do a random search to determine the values that best fit the collection. We repeatedly select random values from an interval of [1.2, 2.0] for 1 and [0.0, 1.0] for , run our 50 training queries and evaluate the results. We take the values that resulted in the best performance after 30 repetitions as our final values. We use the mean average precision (MAP) metric to quantify the performance of an iteration, shown in the following formula: where is the set of training queries, is a single query and || is the number of queries. Further, the average precision of a query

is calculated with the following formula: where is the number of retrieved documents, () is the Precision @ for query , | | is the number of relevant documents for query and () is 1 if document at rank is relevant otherwise 0.

3.3. Finding problematic terms

As we are unfamiliar with the Indian legal domain and we consequently do not know the typical keywords and phrases of the domain, we attempt to gain some insight into the domain using the relevance judgements that we have. We looked at the BM25 scores assigned to individual query terms (see Formula 1) of relevant and irrelevant documents and noticed a few issues: • If there are enough query terms with a high term frequency and a high document frequency like “court”, this can cause an irrelevant document to be ranked higher than a relevant one. • There are query terms that have a high score in irrelevant documents, but not in relevant ones, because they either are less frequent in relevant documents or do not occur there. • It appears that some documents that are relevant for poor-performing queries are suppressed by irrelevant documents. In these documents most high-scoring terms also appears in irrelevant documents and have a higher term frequency (relative to document length) there, while at the same time containing no high-scoring terms that are unique to them.

Based on these findings, we develop a heuristic to detect query terms that would cause irrelevant documents to be ranked higher than relevant ones. These ”additional stopwords” detected by the heuristic are then removed from the query and every remaining term of the query is treated as a search term. We experiment with the following approaches for detecting these ”additional stopwords” in the query: Word Count: We filter out the most frequent words in the corpus. We preprocess precedents and statutes and count how often each term occurs in each respective corpus. Using this information, we add the 200 most frequent terms to our list of ”additional stopwords”. This is done for precedent and statute documents separately.

False Friends: We measure the BM25 scores assigned to individual query terms (see Formula 1) of our training queries and compare the results for relevant and irrelevant documents. Using the static hyperparameters 1 = 1.2 and = 0.75 , we calculate a ranking for each training query. Then we retrieve4 the scores of each term for each relevant document and for the first 100 irrelevant documents. For each query term across all training queries, we calculate a classification and from the maximum score of that token for all retrieved relevant and irrelevant documents respectively. The classification and for a token is either ’Not Found’ if the query token was not found in the retrieved documents, ’Low’ if the maximum score was at or below the threshold and ’High’ if the maximum score was above . We add those tokens 4Using the elasticsearch explain functionality

basic false_friends word_count

basic false_friends word_count 0.07 0.05 0.06 that have a label = ’High’ AND = ’Low’ or ’Not Found’ to our stopword list. Based on a separate grid search experiment, we set the parameter = 1.5 .

However, tests with these methods on the training data using cross-validation showed that they did not consistently improve the retrieval results. Due to a lack of further development time, we submitted these methods as runs to measure their performance on the test set.

4. Results

We submitted three runs for precedents and statutes each. The basic run only performs the hyperparameter search, while the word_count and false_friends run both include the hyperparameter search and their respective method of detecting additional stopwords. The results of these runs are shown in Table1 and show that among our runs the basic method generally achieves the best results. Compared to the other groups, our best method can be found around the middle of the ranking. The best overall precedent retrieval MAP was 0.1573 (run UB-3) and best overall statute retrieval MAP was 0.3851 (run scnu_1).

This tells us that using BM25 with some preprocessing and hyperparameter tuning is still a good start when trying to work with a new domain. However, our method of removing additional stopwords from the query proved detrimental and other methods of extracting keywords from document queries should be considered. Removing the most frequent terms from a query either does not retrieve more relevant documents or makes more relevant documents more dificult to retrieve, hinting that these tokens still carry some useful information even if they are very frequent. Also, the way we remove terms based on their BM25 scores might be very prone to overfitting on the training set. It might still be possible that these removed terms are important for unknown relevant documents of unknown queries. A better way to work with such a ’High/Low’ classification might be to assign higher weights to (boost) query terms of which we know they score highly in relevant documents.

5. Conclusion

We conclude that BM25 can be a good starting point when working with an unfamiliar domain. In the Indian legal domain and with little hyperparameter tuning, it achieves a MAP about 18% lower than the top result on precedent retrieval and about 32% lower than the top result on statute retrieval. We attempted utilizing the word count and the BM25 query token scores of training queries to detect unimportant or harmful tokens as additional stopwords. However, removing either from the query document did not improve results.

Acknowledgments

Project partly supported by BRISE-Vienna (UIA04-081), a European Union Urban Innovative Actions project.

[1]

Bhattacharya ,

Mehta ,

Ghosh ,

Pal ,

Bhattacharya ,

Majumder , Overview of the FIRE 2020 AILA track: Artificial Intelligence for Legal Assistance , in: Proceedings of FIRE 2020 - Forum for Information Retrieval Evaluation , Hyderabad, India, 2020 .

[2]

S. E.

Robertson ,

Walker ,

Jones ,

M. M.

Hancock-Beaulieu ,

Gatford , Okapi at TREC-3, Nist Special Publication Sp 109 ( 1995 ) 109 . Publisher: NATIONAL INSTIUTE OF STANDARDS & TECHNOLOGY .

[3]

Bhattacharya ,

Ghosh ,

Pal ,

Mehta ,

Bhattacharya , P. Majumder, FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance , in: Proceedings of the 11th Forum for Information Retrieval Evaluation , 2019 , pp. 4 - 6 .

[4]

Zhao ,

Ning , L. Liu,

Huang ,

Kong ,

Han , Z . Han, FIRE2019@ AILA: Legal Information Retrieval Using Improved BM25 ., in: Working Notes of FIRE 2019 - Annual Meeting of the Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings , volume 2517 , Kolkata , India, 2019 , pp. 40 - 45 .

[5]

Gao ,

Ning ,

Sun ,

Liu ,

Han , L . Kong, H. Qi, FIRE2019@ AILA: Legal Retrieval Based on Information Retrieval Model ., in: Working Notes of FIRE 2019 - Annual Meeting of the Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings , volume 2517 , Kolkata , India, 2019 , pp. 64 - 69 .

[6]

Shao ,

Ye , THUIR@ AILA 2019: Information Retrieval Approaches for Identifying Relevant Precedents and Statutes ., in: Working Notes of FIRE 2019 - Annual Meeting of the Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings , volume 2517 , Kolkata , India, 2019 , pp. 46 - 51 .

[7]

Lipani ,

Lupu ,

Hanbury ,

Aizawa , Verboseness fission for BM25 document length normalization , in: Proceedings of the 2015 International Conference on the Theory of Information Retrieval , 2015 , pp. 385 - 388 .