1. Introduction

Query Revaluation Method For Legal Information Retrieval

Liang Liu

Lexiao Liu

Zhongyuan Han

1 0 Beihang University , Beijing , China 1 Foshan University , Foshan , China 2 Heilongjiang Institute of Technology , Harbin , China

In this paper, we introduced in detail the method of implementing the task of identifying relevant prior cases in artificial intelligence for legal assistance. For the task, we transformed the problem into a retrieval task and used the BM25 retrieval model to try to make it perform better in this task. The improved method wins second place on MAP and the second place on BPREF.

1 Legal Information Retrieval Language Model BM25 IDF Identifying Relevant Prior Case

1. Introduction 2. Methods

For the task of Identifying relevant prior cases, we treated it as an information retrieval task and submitted three runs with BM25. 2.1. According to the statistics, the query, which is a description of the situation in Query_doc, contains over 500 words on average, and the document, which the prior case in Object_casedocs, contains over 3,000 words on average. For traditional retrieval, the query sentence in the task is too long.

Consequently, we should preprocess the data to shorten the length of the sentence without losing its main meaning. As we all know, the common method is to remove all stop words, we also chose this method and converted the text to lowercase. Finally, we use Lucene toolkit4 to index the document. 2.2.

double_liu_2020_1 For this submission, we chose the BM25 model and improved it by modifying its relevant calculation, as follows:

n BM25(D, Q)   IDF (qi )  i1

TF (qi , D)  (k1  1) TF (qi , D)  k1  (1  b  b  | D | avgdl ) In this formula ( 1 ), we given the definition of BM25, where qi is the word in Q, |D| is the length of document D, and avgdl is the average document length in the text collection, k1 and b are the parameters of BM25. In this task, we set the parameter k1=2.99 and b=0.65.

Furthermore, we modify the relevant computation to get an improved BM25.

rel(D, Q)  BM 25(D, Q)  BM 25(D, Q' ) ( 2 ) where Q is a query sentence with stop words removed, and Q' is a keyword that is further extracted from Q, here we choose the IDF algorithm to sort the words in Q, and form the top m% words into Q'. m is a free parameter, and we set m=50. 2.3.

double_liu_2020_2 Inspired by the former, we split the method in double_liu_2020_1 into two sub-methods as our double_liu_2020_2 and double_liu_2020_3.

In the double_liu_2020_2 submission, we chose the first half of formula ( 2 ) to form our method one, as follows:

rel(D, Q)  BM 25(D, Q) All the other settings are followed double_liu_2020_1. 2.4.

double_liu_2020_3 ( 1 ) ( 3 ) For this submission, we choose the second half of formula ( 2 ) to form method three, as shown below: rel(D, Q)  BM 25(D, Q' ) ( 4 ) All the other settings are also followed double_liu_2020_1. 2.5.

Other methods 2.5.1 Cosine Similarity

We also tried other experiments, but the results were not satisfactory.

For this method, we want to rank the cosine similarity between the query sentence and the document as an indicator. Firstly, we use the bag-of-words model to construct word vectors for the query sentence and the document respectively and then calculate the cosine similarity and rank. The formula for cosine similarity is as follows: 4 https://lucene.apache.org/ rel(D, Q)  Cos( A, B) 

A  B | A |  | B |

 Where A and B are two vectors. 2.5.2 Generalized Jaccard Similarity n  ( Ai  Bi ) i1 n  ( Ai )2  i1 n  (Bi )2 i1 In this method, we choose to use generalized Jaccard similarity as an indicator to sort. n  min(Ai , Bi ) rel(D, Q)  J( A, B)  i1 n  max( Ai , Bi ) i1

2.5.3 Cosine with Jaccard

In this method, we improve the previous two methods and introduce the parameter k. The specific formula is as follows:

rel(D,Q)  k  Cos( A, B)  (1 k)  J ( A, B) (7) where k is a free parameter, and we set k=0.3

3. Results 3.1. Evaluation Measures

Standard Information retrieval metrics like Measures like Precision, Recall, Mean Average Precision (MAP)5, Discounted Cumulative Gain(DCG) and Mean Reciprocal Rank(MRR) will be used for evaluation in the task. 3.2.

Evaluation Results

Run_ID double_liu_2020_3 double_liu_2020_1 double_liu_2020_2

Jaccard Cosine_Jaccard_k

Cosine 5 https://trec.nist.gov/pubs/trec16/appendices/measures.pdf ( 5 ) (6)

4. Conclusion

In this task, we describe a method that uses an improved BM25 to identify relevant priors, and it can be concluded that using certain algorithms to extract keywords will improve efficiency. Compared with other submissions of the task, our improved BM25 model can get the second place in MAP and BPREF.

5. Acknowledgments

This work is supported by National Social Science Fund of China (No.18BYY125).

6. References

[1] Mandal , A. , Ghosh , K. , Bhattacharya , A. , Pal , A. , Ghosh , S. : Overview of the fire 2017 irled trac k: Information retrieval from legal documents// Proceedings of FIRE 2017 - Forum for Information Retrieval Evaluation , 2017 : 63 - 68

[2] Bhattacharya , P. , Ghosh , K , Ghosh, S. , Pal , A. , Mehta , P. , Bhattacharya , A. , Majumder P. : Overv iew of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance//Proceedings of F IRE 2019 - Forum for Information Retrieval Evaluation , 2019 .

[3] Zhao , Z. , Ning , H. , Huang , C. , Kong , L., Han , Y ., Han , Z .: Fire2019@aila: Legal information ret rieval using improved bm25// Proceedings of FIRE 2019 - Forum for Information Retrieval Evaluation , 2019 : 40 - 45 .

[4] Robertson

, Zaragoza

, Taylor M. Simple BM25 extension to multiple weighted fields//Procee dings of the thirteenth ACM international conference on Information and knowledge managemen t . 2004 : 42 - 49 .

[5]

Bhattacharya ,

Mehta ,

Ghosh ,

Pal , A. Bhattacharya. ,

Majumder , Overview of the Fire 2020 AILA track: Artificial Intelligence for Legal Assistance . In Proc. of FIRE 2020 - Forum for Information Retrieval Evaluation , Hyderabad, India, December 16-20 , 2020 .