Combining lexical and neural retrieval with longformer-based summarization for effective case law retrieval Arian Askari1 , Suzan Verberne1 1 Leiden Institute of Advanced Computer Science, Leiden University Abstract In this paper, we combine lexical and neural ranking models for case law retrieval. In this task, the query is a full case document, and the candidate documents are prior cases that are potentially relevant to the current case. Most documents are longer than 1024 tokens, which makes retrieval and classification with Transformer-based models problematic. We create shorter query documents with different methods: term extraction, noun phrase extraction, entity extraction, and automatic summarization using Longformer-Encoder-Decoder (LED). We then combine the summaries with five different ranking models: a BM25 ranker, statistical language modelling, the Deep Relevance Matching Model (DRMM), a Vanilla BERT ranker, and a Longformer ranker. We optimised all models and combined the best lexical ranker with neural retrieval models using different ensemble classifiers. We evaluate our methods on the retrieval benchmarks from COLIEE’20 and COLIEE’21. We beat state-of-the-art models for case law retrieval with both benchmark sets. Our experiments show the importance of tuning lexical retrieval methods, summarizing query documents, and combining lexical and neural models into one ranker for effective case law retrieval. In addition, training and optimizing our rankers is much faster than passage-level retrieval models (a few hours compared to several days for training). Keywords Legal Information Retrieval, Query Summarization 1. Introduction In this paper, we address legal case retrieval (Task 1). One of the challenges of case law retrieval in COL- In countries with common law systems, finding support- IEE’20 is that the input is a long case document with a ing precedents to a new case, is vital for a lawyer to median length of 2,815 words instead of a keyword query. fulfill their responsibilities to the court. However, with Four approaches to this problem exist. The first is the use the large amount of digital legal records – the number of unsupervised keyword extraction methods to create of filings in the U.S. district courts for total cases and short queries from the long query document [3]. A vari- criminal defendants was 544,460 in 20201 – it takes a sig- ant is the unsupervised extraction of phrases or entities nificant amount of time for legal professionals to scan for as query terms [4]. The second approach, proposed by specific cases and retrieve the relevant sections manually. Tran et al. [5], is to train a supervised phrase scoring Studies have shown that attorneys spend approximately model for 𝑛-gram phrases to select the phrases that are 15 hours in a week seeking case law [1]. semantically closest to an expert-written summary. The This workload necessitates the need for information third approach, proposed by Rossi and Kanoulas [6], is retrieval (IR) systems specifically designed for the legal to use document summarization methods for creating domain. The Competition on Legal Information Extrac- shorter query documents. The fourth approach, success- tion/Entailment (COLIEE) is a workshop that has been fully employed by Shao et al. [4] and Westermann et al. organized since 2014 as a series of evaluation competi- [7], is to analyze the documents on the level of individual tions related to case law [2]. COLIEE defines four tasks. paragraphs and then aggregate the paragraph scores in a document ranking. DESIRES 2021 – 2nd International Conference on Design of In this paper, we use automatic summarization for Experimental Search Information REtrieval Systems, September creating query documents. We experiment with term 15–18, 2021, Padua, Italy extraction, noun phrase extraction, and supervised text " a.askari@liacs.leidenuniv.nl (A. Askari); s.verberne@liacs.leidenuniv.nl (S. Verberne) summarizers. As opposed to prior work, we approach ~ https://www.universiteitleiden.nl/en/staffmembers/arian-askari the task as an abstractive summarization problem. The (A. Askari); https://liacs.leidenuniv.nl/~verbernes/ (S. Verberne) current state of the art in abstractive summarization is  0000-0003-4712-832X (A. Askari); 0000-0002-9609-9505 the use of Transformer models [8, 9]. However, the input (S. Verberne) of pre-trained available models of these architectures © 2021 Copyright for this paper by its authors. Use permitted under Creative CEUR Workshop http://ceur-ws.org Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) is limited to 1024 tokens, and the majority of case law documents in our collection is longer than that. Beltagy ISSN 1613-0073 Proceedings 1 https://www.uscourts.gov/statistics-reports/judicial-business- 2020 et al. [10] proposed Longformer-Encoder-Decoder (LED), which is a Transformer variant that supports much longer ment summarization. inputs. In this paper, we evaluate the effectiveness of LED In COLIEE’20, the two best-performing teams use for case law retrieval. paragraph-level analyses to cope with the challenge of We evaluate summarization with multiple ranking long documents. Shao et al. [4] (‘TLIR’) participate with models: probabilistic lexical ranking (BM25), statistical their method BERT-PLI [15], which models paragraph- language modelling, the Deep Relevance Matching Model level interactions. They combine BERT-PLI with lexical (DRMM), and Transformer-based architectures including matching features using word-entity duet model. The the Vanilla BERT model from Contextualized Embed- features are different lexical rankers (BM25, probabilistic dings for Document Ranking (CEDR) [11]. Although we language modelling) – without optimization – on the full summarize the query documents, the lengths of the doc- document content, and entities extracted by NLTK. They uments in the retrieval collection still causes problems reach competitive results. The method by Westermann for Transformer-based ranking architectures, which are et al. [7] (‘cyber’) selects the top-30 candidate documents limited to 512 tokens as the input length. To solve that using a paragraph similarity score based on the universal problem, we implemented a Vanilla Longformer ranker sentence encoder, and then applies an SVM model to the [10], with 4096 tokens as the input length, in a pair-wise TF-IDF representations of the query document and the ranking setting similar to Vanilla BERT. After we have candidate documents. optimized each individual ranker, we experiment with en- We use the supervised phrase-based summarization of semble methods that combine lexical rankers with neural Tran et al. [14, 5] as comparison for our summarization rankers. task, and the highest F-score obtained in COLIEE’20 by Our contributions are four-fold: (1) We deliver a fine- Westermann et al. [7] as comparison for our retrieval tuned Longformer-Encoder-Decoder (LED) for abstrac- task. tive summarization of legal documents; (2) we deliver a pairwise Longformer ranker for long documents; (3) 2.2. Summarization of long documents We show that summarizing query documents with LED improves all ranking models; (4) We show that the com- Kanapala et al. [16] and Van de Luijtgaarden [17] give bination of a lexical ranker and a Vanilla BERT ranker in an extensive overview of research on legal document a simple ensemble classifier outperforms all baselines on summarization up to 2019. Here we focus on the recent COLIEE’20 and an optimized BM25 ranker with keyword abstractive summarization models. queries beats the state-of-the-art models on COLIEE’21. Pre-trained encoder-decoder Transformer models (e.g. BART [8] and T5 [9]) have achieved strong results in abstractive summarization tasks. However, pre-trained 2. Related Work models of these architectures are limited to texts that are shorter than 1024 tokens. Legal documents are com- 2.1. Case law retrieval monly longer than that; 72% of the query documents in Locke et al. [3] investigate query generation from legal COLIEE’20 are. Recently, Beltagy et al. [10] proposed decisions using unsupervised keyword extraction models. Longformer-Encoder-Decoder (LED), which is a Trans- They find that the best performing model is Kullback- former variant that supports sequence-to-sequence tasks Leibler divergence for informativeness (KLI) [12] and that for longer documents (up to 16k tokens). The authors the automatically generated queries were more effective show that LED outperforms the state-of-the-art models than the average Boolean queries from experts. on the arXiv dataset. In this paper, we evaluate the effec- The majority of the work addressing case law retrieval tiveness of LED for case law retrieval. takes place in the context of COLIEE, the Competition on Legal Information Extraction and Entailment [13, 2]. 2.3. Transformer-based document Rossi and Kanoulas [6] proposed a pairwise ranking ranking model based on BERT. They apply automatic summa- rization using TextRank, to make the input document The typical approach in neural IR is two-step retrieval, length suitable for use in the BERT-based ranker. The where a first set of documents is retrieved using a tra- most successful team in COLIEE 2019 is Tran et al. [14, 5] ditional ranker (e.g. BM25) and those document are re- (‘JNLP’). They train a phrase scoring model that extracts ranked by a neural model that is trained on the relevance 𝑛-gram phrases (𝑛 ≥ 5) to summarize the documents. assessments [18, 11, 19]. The weights are calculated based on the phrase score MacAvaney et al. [11] propose a joint approach that framework that was trained on COLIEE’18 summaries. integrates the classification vector of BERT into existing The authors achieved the state-of-the-art result on COL- neural models like the Deep Relevance Matching Model IEE’19, thereby showing the importance of query docu- (DRMM), resulting in Contextualized Embeddings for Document Ranking (CEDR). They fine-tune a pretrained BERT model with a linear combination layer stacked atop Entity and phrase extraction Inspired by Shao et al. the [CLS] token as the Vanilla BERT ranker with pairwise [4], we used NLTK3 to extract noun phrases and named cross-entropy loss and the Adam optimizer. The authors entities from the content of query documents and candi- demonstrate that Vanilla BERT and CEDR outperform date documents and used the extracted strings entities the state-of-the-art baselines in ad-hoc rankings. How- as the representation documents (see Section 4.4 for the ever, local-interaction neural ranking architectures like combinations we experimented with). DRMM are not scalable to long documents or need heavy interaction between word pairs in query and document. Abstractive summarization We experimented with Therefore, we will compare our method to the Vanilla the pre-trained LED model4 of Beltagy et al. [10] which BERT ranker. can process documents up to 16k tokens as input. Also, More recent work has addressed the challenges of rank- we fine-tuned LED on the COLIEE’18 dataset, in which ing for long documents. Hofstätter et al. [20] propose a more than 80% of documents have summaries.5 After local attention Transformer model that uses a moving removing duplicates, 6, 257 unique documents are left window over the document terms and for each term at- for which a summary is available. For evaluation pur- tends only to other terms in the same window. They poses, we trained the model in a k-fold cross-validation obtain results significantly better than other state-of-the- setting (𝑘 = 10) for one epoch per fold, each with batch art models on the TREC Deep learning track. Sekulić size 1. We kept the other hyperparameters (optimizer, et al. [19] take a full-document approach by training a dropout, weight decay) identical to [10] and set the global Longformer model for ranking in ad-hoc retrieval. The attention on the first token. We only summarized results they report on the MS MARCO set are low com- query documents with LED for the lexical rankers since pared to the leaderboard results. Our Longformer ranker they do not have limitation for input length. On other is similar to [19], but instead of implementing the ranker hand, since Transformer-based neural models are limited as a one-versus-all classifier, we train and evaluate it in a in input length for the candidate document content, we pair-wise setting and we evaluate it for case law retrieval. also experiment with summarizing candidate documents besides query documents for the best Transformer-based neural model in our experiments. 3. Methods 3.1. Summarization 3.2. Ranking models We experiment with three approaches to creating shorter As introduced in section 3.2.1, we rank the 200 candidate query documents: (1) term extraction, (2) noun phrase documents for each query document in COLIEE’20 with or entity extraction, and (3) abstractive summarization. multiple retrieval models. We optimise the hyperparam- eters of each method on a validation set (see Section 4.3). Summarization through term extraction We In the following, we will introduce each neural ranker adopted Kullback-Leibler divergence for Informativeness that we use for legal case retrieval. (KLI), similar to Locke et al. [3] in the implementation of Verberne et al. [21]. For each term 𝑡 in a query document, 3.2.1. Lexical rankers we computed the KLI score: BM25 We indexed the COLIEE’20 collection with Elas- 𝑃 (𝑡|𝐷) ticSearch. The collection has 200 candidate documents 𝐾𝐿𝐼(𝑡) = 𝑃 (𝑡|𝐷) × log (1) for each query that need to be ranked. We used BM25 𝑃 (𝑡|𝐶) with the default parameter values 𝑘 = 1.2 and 𝑏 = 0.75, where 𝑃 (𝑡|𝐷) is the probability of 𝑡 in the query doc- as well as with optimized hyperparameter values. ument 𝐷 and 𝑃 (𝑡|𝐶) is the probability of 𝑡 in a back- ground language model. We use all candidate documents Language Modelling We used the built-in similar- as the background collection to compute 𝑃 (𝑡|𝐶). We ity functions of Elasticsearch for the implementation of only consider unigrams as terms and we lowecase them.2 Language Modelling (LM) with two different smoothing: We then selected the top-10% of the total number of Dirichlet smoothing and Jelinek Mercer (JM) smoothing. terms in the document with the highest KLI score as a We only report the results for JM smoothing since we got the query. 3 https://www.nltk.org/api/nltk.chunk.html?#nltk.chunk.util. tree2conlltags and https://www.nltk.org/api/nltk.chunk.html# module-nltk.chunk.named_entity 4 https://huggingface.co/allenai/led-large-16384-arxiv 2 5 We tried to extract n-gram phrases (2<𝑛<=5) with 𝛾 = 0.8 In COLIEE’19 and COLIEE’20 the large majority (82%) of the but obtained the best results with unigrams. candidate cases do not have a summary. similar results by these two smoothing methods. We also Our code is integrated with Vanilla BERT in CEDR optimised the hyperparameter value (𝜆) for Language [11], and is available for future work.6 Modelling with Jelinek Mercer smoothing (LM JM). 3.2.3. Ensemble models 3.2.2. Neural rankers For combining the advantages of neural rankers and lex- Deep Relevance Matching Model (DRMM) As the ical rankers in one integrated system, we train ensemble DRMM architecture [18] is based on a local interaction models that take the scores of multiple rankers as fea- input matrix, only a limited query length is possible. We tures. We experiment with three different classifiers for took the top 10% of query terms sorted by the KLI score this purpose: SVM with a linear kernel, SVM with an RBF for each query document. Then, we selected the average kernel, Naive Bayes, and Multi Layer Perceptron (MLP). resulting number of terms, 70, as the maximum length We experiment with different combinations of rankers’ of query. Thus, we used the top-70 query terms as query scores to find the best combination of rankers for having in DRMM. For calculating cosine similarity in the local an effective ranking. interaction matrix, we trained word2vec on the query candidates’ texts as suggested by the DRMM authors. Furthermore, we optimize the network configuration of 4. Experiments and results DRMM to find the best combination of layers and neu- rons in legal case retrieval on COLIEE (see details in 4.1. Data Section 4.3). For our experimental evaluation, we work with data from the COLIEE competitions in 2018, 2020, and 2021. Vanilla BERT ranker For Vanilla BERT, we fine-tuned The COLIEE’18 data contains human-written sum- a pre-trained BERT model (BERT-Base, Uncased) with maries of the case documents, which we use for the train- a linear combination layer stacked atop the classifier ing evaluation of our summarization models (Section 4.2). [CLS] token on the COLIEE dataset in pairwise cross- For our retrieval experiments (Section 4.3), we use data entropy loss setting using the Adam optimizer. We used from COLIEE’20 and ’21.7 The Federal Court of Canada the implementation of MacAvaney et al. [11] (CEDR). We provided case laws with metadata for task 1. The meta- represent the query as sentence A and the document as data contains references to the noticed cases that are sentence B in the BERT input: the golden relevance labels for the query document. In COLIEE’20, there is a pool of 200 candidates for each “[CLS] query document [SEP] candidate document [SEP]" query document and the competitors should re-rank a limited number of documents per query. The pool of We truncate the query and candidate document text since candidates includes the noticed cases and non-relevant the BERT tokenizer is limited to 512 tokens. candidates, which are selected randomly. In contrast, in COLIEE’21, the whole collection should be considered Vanilla Legal BERT ranker For Vanilla Legal BERT, per query without having a pool of candidates. This dif- we used LEGAL-BERT [22] which was pre-trained on ference makes task 1 in COLIEE’21 more difficult than in legal data. COLIEE’20. We use the COLIEE’20 data to evaluate all ranking Vanilla Longformer ranker Since the input length models and ensembles described in Section 3. In the is limited in Vanilla BERT, we implemented the Vanilla COLIEE’20 data, there are 520 query documents in the Longformer in CEDR [11] as a ranker which can receive train set and 130 in the test set; with 104,000 candidate 4096 tokens instead of 512 and has more chances to work documents in the train set and 26, 000 in the test set. The effectively than Vanilla BERT. In Longformer, the [CLS] average length of the documents in the test set is 3,232 and [SEP] tokens are replaced by the tokens , and words, with outliers upto 10, 827. After we have found respectively. As suggested in the Longformer paper the best-performing rankers and ensemble, we evaluate [10] we calculate the loss based on the token with the those on the COLIEE’21 data and compare the results addition of global attention to the token. Inspired to the best results reported in the competition. In the by Sekulić et al. [19] we feed the query document as COLIEE’21 data, there are 650 query documents in the sequence A and the candidate document as sequence B train set and 250 in the test set, with 4, 415 documents to the tokenizer, which yields the following input to the as candidate documents for both train set and test set. model: 6 https://anonymous.4open.science/r/vanilla_ “ query document candidate document ". longformer-D552/README.md 7 https://sites.ualberta.ca/~rabelo/COLIEE2020/ and https://sites.ualberta.ca/~rabelo/COLIEE2021/ Table 1 Summarization results in terms of ROUGE for COLIEE’18. Summary length is 10% of the original text ROUGE-1 ROUGE-2 ROUGE-SU6 Model\Metric Pre Rec F1 Pre Rec F1 Pre Rec F1 JNLP [5] 0.482 0.409 0.405 0.186 0.152 0.152 0.258 0.199 0.167 Pre-trained LED on arXiv [10] 0.634 0.164 0.260 0.299 0.072 0.116 0.332 0.078 0.127 Fine-tuned LED 0.620 0.304 0.408 0.295 0.138 0.188 0.319 0.147 0.201 The average length of the candidate documents is 1, 274, LM We found 𝑘 = 6 as the optimal cut-off for all vari- with outliers upto 76, 818 words. ants of LM JM. For the optimization, we searched the following values: 𝜆 ∈ {0, 0.1, 0.2, ·, 1}.10 4.2. Summarization experiments DRMM We optimize the word2vec model and also We set the local attention window size to 512 tokens. To tune the network configuration (e.g., numbers of lay- limit memory use, we use gradient checkpointing and set ers and hidden nodes) on the validation set. 𝑘 = 6 is the input size in training to 8,192 tokens which covers the optimal cut-off value. We trained six word2vec and more than 86% of COLIEE’18 documents completely fasttext models for three configurations from the litera- (the longer documents are truncated at 8,912 tokens). We ture [18, 23, 24]. We also used the pre-trained word2vec set the maximum length to generate a summary for an on google-news and pre-trained fasttext on Wikipeda. unseen document as 10% of the length of the original We found that word2vec which was pre-trained accord- text. ing to the DRMM configuration gave the best results. For We evaluate our fine-tuned LED to the pre-trained the network configuration, we use a four-layer architec- LED [10], and compare it to the summarizer of Tran et al. ture throughout all experiments, i.e., one histogram input [5] (JNLP) on the COLIEE’18 data. Table 1 shows that layer (30 nodes), two hidden layers in the feed-forward our fine-tuned summarizer outperforms the baseline in matching network (128 nodes for both layers), and one terms of F-measure for ROUGE-1, ROUGE-2 and ROUGE- output layer (1 node) with the term gating network for SU6 on the COLIEE’18 dataset. The pre-trained LED the final matching score. We set the maximum query summarizer obtains the highest precision-ROUGE scores, length to 70 tokens as explained in Section 3.2.2. and the JNLP baseline the highest recall-ROUGE scores. Vanilla BERT ranker We truncate the documents 4.3. Retrieval experiments such that the concatenated query document (truncated at 100 words), candidate document, and the separator As the validation set for optimizing the rankers we use a tokens do not exceed 512 tokens. We re-rank the top-30 held-out subset (10% of the training set). We optimize the BM25 results. Since the 𝑅@30 is about 95% for BM25, rankers for each document representation. In the text be- our ranker has the possibility to achieve 95% recall while low, we refer to noun phrase representations as Q NP and still having a reasonable runtime. 𝑘 = 6 was the optimal C NP for the query/candidate documents respectively, cut-off for the arXiv-LED summarizer queries, and 𝑘 = 4 and to entity representations as Q Entities and C Enti- for the fine-tuned LED summarizer queries. We train ties. We use Precision@k, Recall@k and F-measure@k each model for 100 epochs, each with 32 batches of 16 as evaluation metrics, following the COLIEE evaluation 8 training pairs, with the initial learning rate of 3 * 10−5 , mechanism. We select the best cut-off 𝑘 for each method followed by a power 3 polynomial decay. based on the validation set and use that on test set. Vanilla Longformer ranker The local window size BM25 We found 𝑘 = 6 as the optimal cut-off for rank- is set to 512. We fine-tune the pre-trained Longformer ing with the whole text, and 𝑘 = 4 for summarized using pairwise hinge loss. Positive and negative training input. For the optimization we searched the following documents are selected from the relevance judgments. grid: 𝑏 ∈ {0, 0.1, 0.2, ·, 1} and 𝑘1 ∈ {0, 0.1, 0.2, ·, 3}. We truncate the document such that the sequence of For BM25 with KLI the best parameters were 𝑏 = 0.9, the concatenated query document (summary), candidate and 𝑘1 = 2.8.9 document, and the separator tokens do not exceed 4,096 8 10 See https://sites.ualberta.ca/~rabelo/COLIEE2020/ We found 𝜆 = 0.1 as the best parameters for LM JM with 9 We found 𝑏 = 0.6, 0.8, 0.8 and 𝑘1 = 1.8, 1.4, 1.4 as the KLI. We found 𝜆 = 0.1, 0.6, 0.4 as the best parameters for LM JM best parameters for BM25 (Q NP C NP, Q words C NP, and Q NP C (Q entities D entities, Q words D entities, and Q entities D words) words) respectively. respectively. Table 2 Table 3 Lexical retrieval results for the ranking of 200 candidate doc- Neural retrieval results for the ranking of 200 candidate doc- uments in COLIEE’20. Q refers to query content and C refers uments in COLIEE’20. Q refers to query content and C refers to candidate document content. SummaryQ means that the to candidate document content. SummaryQ/SummaryQC summary of the query document is used as query. NP refers means that the summary of the query/both query and can- to the extracted noun phrases (NP) using NLTK. Q/C NP didate document (generated by fine-tuned LED) are used as means that the extracted noun phrases from query and can- the content in the ranker’s input. didate document are used as the query and candidate docu- Method Extractor/Sumarizer P% R% F1 % ment content in the ranker’s input. DRMM original text 27.05 37.16 31.31 DRMM KLI (1-gram) 47.95 67.43 56.05 Method Extractor/Sumarizer P% R% F1 % Vanilla BERT original text 36.92 50.16 42.53 BM25 original text 47.69 67.48 56.06 Vanilla BERT summaryQ arXiv-LED 40.26 57.02 47.19 BM25 optimized original text 57.31 59.15 58.21 Vanilla BERT summaryQ fine-tuned LED 49.23 62.22 54.96 BM25 KLI (1-gram) 58.85 62.28 60.51 Vanilla BERT summaryQC fine-tuned LED 29.10 45.50 35.49 BM25 optimized KLI (1-gram) 67.00 61.17 63.95 Vanilla Legal BERT summaryQ fine-tuned LED 46.77 61.54 53.14 BM25 summaryQ arXiv-LED 48.65 69.03 57.07 Vanilla Legal BERT summaryQC fine-tuned LED 40.26 58.34 47.64 BM25 optimised summaryQ arXiv-LED 54.20 64.91 59.07 Longformer original text 30.38 39.00 34.15 BM25 summaryQ fine-tuned LED 49.72 68.59 57.65 Longformer summaryQ arXiv-LED 35.10 43.40 38.81 BM25 optimised summaryQ fine-tuned LED 55.71 63.18 59.21 Longformer summaryQ fine-tuned LED 39.04 44.20 41.46 BM25 optimized Q entities & C words 62.05 50.62 55.75 BM25 optimized Q words & C entities 48.77 58.89 53.35 BM25 optimized Q NP & C NP 55.77 58.05 56.88 LM JM original text 46.54 62.25 53.26 Table 4 LM JM KLI (1-gram) 58.00 61.85 59.86 Results of ensemble classifiers and comparison with the Cy- LM JM optimized KLI (1-gram) 66.24 60.28 63.11 ber team which is the best team on COLIEE’20 [7]. Precision LM JM arXiv-LED 49.20 68.55 57.28 and recall are not reported for the COLIEE’20 best result LM JM fine-tuned LED 49.68 68.44 57.57 LM JM optimized Q NP & C words 61.94 50.30 55.51 Method Features P% R% F1 % LM JM optimized Q words & C NP 47.65 57.94 52.29 Naive Bayes Best BM25 & best Vanilla BERT 72.95 83.66 76.02 LM JM optimized Q NP & C NP 55.19 57.67 56.40 SVM linear Best BM25 & best Vanilla BERT 71.74 83.77 74.61 SVM RBF Best BM25 & best Vanilla BERT 70.11 83.24 72.37 MLP Best BM25 & best Vanilla BERT 70.68 83.46 73.20 Cyber (first team in COLIEE’20) - - 67.74 tokens. We again re-rank the top-30 BM25 results. for the arXiv-LED summarizer queries, and 𝑘 = 4 for the fine-tuned LED summarizer queries. We use the same Table 5 training configuration as for Vanilla BERT. Retrieval results for the ranking of documents of COLIEE’21. Precision and recall are not reported for the COLIEE’21 best result. 𝑘 is the optimal cut-off value for each method on this Ensemble classifier We used the Scikit-learn [25] li- data. brary for training the ensemble classifiers and kept the Method Extractor/Sumarizer P% R% F1 % hyperparameters as default. We used the classifier pre- BM25 original text (𝑘 = 7) 7.77 19.59 11.13 diction (relevant/non-relevant) to obtain the returned BM25 KLI (1-gram) (𝑘 = 6) 9.83 19.80 13.13 set of documents that we evaluate. Note that the cut-off BM25 optimized KLI (1-gram) (𝑘 = 4) 17.00 25.36 20.35 Vanilla BERT fine-tuned LED (𝑘 = 7) 2.11 5.46 3.04 parameter 𝑘 is not needed in this approach because the TLIR (first team in COLIEE’21) - - 19.17 final ranking only includes documents that are predicted relevant by classifier. ble 4. With our ensemble models we improve over the 4.4. Retrieval results best benchmark result (the Cyber team) by a large margin. COLIEE’20 results The ranking results for the lexical The best ensemble model in terms of F1 on the validation and neural ranking models are shown in Table 2 and set is a combination of the best BM25 ranker (BM25- Table 3, respectively. The best lexical ranker is BM25, but optimised + KLI) and the second-best neural ranker LM JM with KLI is very close. The best neural ranker in Vanilla BERT (Vanilla BERT + SummaryQ (fine-tuned terms of recall and F1 is DRMM. The best single ranker LED)). This indicates that BERT can add more to the com- overall is BM25 with an F1 score of 63.95%. The highest bination with BM25 than the best neural model DRMM precision is obtained by BM25 with KLI queries and the can. highest recall with BM25 with arXiv-LED. This indicates that lexical matching is important for this dataset. Table 2 COLIEE’21 results For evaluating the generalizability also shows the (mostly positive) effect of optimizing the of our results, we evaluated the best methods on the COL- parameters of BM25, an effort that is not taken by most IEE’21 data. As explained in Section 4.1, the COLIEE’21 of the COLIEE participants. task is more difficult than the COLIEE’20 task because it The results of the ensemble models are shown in Ta- requires retrieval from a full document collection instead of re-ranking 200 documents. We optimised the BM25 Analysis weights of classifiers As suggested in [29], parameters and the cut-off value 𝑘 on the COLIEE’21 we interpret the importance of Vanilla BERT feature in validation set. As shown in Table 5, we beat the state- the ensemble classifiers based on the coefficient value in of-the-art result (TLIR team) with the optimized BM25 fitted SVM (linear). For Vanilla BERT the coefficient is ranker. The poor result of Vanilla BERT on COLIEE’21 higher (0.56) than BM25’s coefficient (0.43). shows us that neural retrieval has more challenges for ranking the whole collection than with re-ranking the Analysis hyperparameters of BM25 The optimal top-200. We also applied ensemble classifiers combining value for 𝑏 in the literature is between 0.3 − 0.9 [30, BM25 and Vanilla BERT but they could not improve the 31, 32] and we found 𝑏 = 0.9 for the optimised BM25 effectiveness; we suppose that it is caused by the lower with KLI. There are documents in COLIEE that, because performance of Vanilla BERT on COLIEE’21. of their length, contain multiple topics, and it was sug- gested before that documents that includes a variety of topics benefit from using a larger 𝑏 so that unrelated top- 5. Discussion ics to a user’s search are penalized.11 The normal range The effect of summarization The results show that for 𝑘1 is between 0 and 3 and for long documents that summarizing the query document improves all rankers. contain diverse information the 𝑘1 should tend to larger This holds for the KLI term extraction, noun phrase ex- numbers [32]. We found the optimal value for 𝑘1 = 2.8 traction, and the LED summarizer. This shows the impor- for BM25 with KLI which makes sense because of the tance of summarization for making the query documents long documents in COLIEE. shorter. Our results show that the best summarizing methods for neural ranking and lexical ranking are LED Using noun phrases or named entities We experi- (fine-tuned) and keyword extraction (KLI) respectively. mented with noun phrases and named entities as doc- However, summarizing the candidate documents does ument representation, inspired by Shao et al. [4]. Our not improve the neural ranking performance. Another experiments show that the effectiveness of using named observation is that fine-tuning the summarizer improves entities instead of original content is much lower than the ranking in terms of F-measure for all rankers. We see of using noun phrases: The F1 score for BM25 optimised the largest effect from summarization for Vanilla BERT. + Q entities C entities is 22.71% while F1 for BM25 op- timised + Q NP C NP is 56.88%. Although the use of Analysis of unexpected results One unexpected re- named entities was suggested in prior work, they do not sult is that our Longformer ranker does not outperform play an important role in case law documents, at least not the Vanilla BERT ranker. We speculate that this is because so much that using them as document representations longformer receives more tokens as the input learning, leads to effective retrieval. which makes it more difficult to estimate the relevance between document and query, while the size of the train- Future work Some prior work has obtained good re- ing set is relatively small: In COLIEE’20, we have only sults with passage-level analysis for long document rank- 2, 680 relevant labels and Longformer could not converge ing [24, 33]. We think it is a promising direction to com- during training – it did not find the optimal loss after bine these passage-level with document-level methods 100 epochs. For future work, we will further pre-train to design a more effective legal case retrieval system Longformer on legal documents. This requires a GPU for future work. One challenge is that this approach with 32GB Ram which is not easily accessible. is computationally expensive since each paragraph of Our results also show that DRMM could not beat BM25 the query document needs to be compared with each for case law retrieval. Some prior work has also indicated paragraph of the candidate documents. We have query that for some datasets, DRMM is close to BM25 in quality documents with up to 1,139 paragraphs in COLIEE. For and that sometimes BM25 works better than DRMM [26, future work we will focus on combining lexical document 27, 28], especially when BM25 is properly optimized. retrieval with efficient paragraph-level retrieval. We will The third unexpected result is that the best Vanilla also evaluate how we can use COGLTX [5] to recognize Legal BERT (summaryQ) does not outperform the best the important sentences from the query cases and docu- Vanilla BERT model. We suppose this can be related to ment cases without having the limitation in length as a language similarity between COLIEE cases and part of pre-processing step. Inspired by [33], another direction the data that BERT Base was trained on (Wikipedia, and is working on neural models for legal case retrieval on more than thousand thousands books) because cases in ranking instead of re-ranking because in coliee’21 there COLIEE contain stories of applicant’s lives and this is are 4, 415 candidates per query (whole collection). almost the bigger part of each document. 11 https://www.elastic.co/blog/practical-bm25 6. Conclusions with generalized language models, Proceedings of the 6th Competition on Legal Information Extrac- In this paper, we addressed the challenge of long doc- tion/Entailment. COLIEE (2019). uments in case law retrieval. We experimented with [7] H. Westermann, J. Šavelka, K. Benyekhlef, Para- the Longformer-Encoder-Decoder (LED) for abstractive graph similarity scoring and fine-tuned BERT for summarization of case law documents in the COLIEE legal information retrieval and entailment., in: COL- benchmark data. The fine-tuned LED outperforms the IEE 2020, 2020. 2019 baseline in terms of F-measure for all three ROUGE [8] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- metrics. hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart: Second, we implemented a pairwise Longformer Denoising sequence-to-sequence pre-training for ranker for long documents and compared it to four natural language generation, translation, and com- other ranking models on the COLIEE’20 benchmark data: prehension, arXiv preprint arXiv:1910.13461 (2019). BM25, LM, DRMM, and Vanilla BERT. We found how- [9] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, ever that BM25 outperforms all neural rankers on this M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the task and that the Longformer ranker is outperformed by limits of transfer learning with a unified text-to- BERT. text transformer, arXiv preprint arXiv:1910.10683 Third, we evaluated the merits of query document sum- (2019). marization for the BM25, BERT and Longformer rankers. [10] I. Beltagy, M. E. Peters, A. Cohan, Longformer: We found that summarizing the query document im- The long-document transformer, arXiv preprint proves the quality of each of the rankers compared to us- arXiv:2004.05150 (2020). ing the original document. For BM25, we also compared [11] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, the LED-summary to statistical query term extraction Cedr: Contextualized embeddings for document (KLI), and we found that summarization gives a higher ranking, in: Proceedings of the 42nd International recall, but term extraction gives a higher precision. ACM SIGIR Conference on Research and Develop- Fourth, we showed the effectiveness of combining an ment in Information Retrieval, 2019, pp. 1101–1104. optimised BM25 ranker and a BERT ranker, outperformin- [12] T. Tomokiyo, M. Hurst, A language model approach ing the state of the art on two benchmark sets. We con- to keyphrase extraction, in: Proceedings of the ACL clude that retrieval for long query documents in legal 2003 workshop on Multiword expressions: analysis, case retrieval can be helped by optimising lexical models, acquisition and treatment, 2003, pp. 33–40. automatic summarization, and a combination of both. [13] J. Rabelo, M.-Y. Kim, R. Goebel, M. Yoshioka, Y. Kano, K. Satoh, A summary of the coliee 2019 competition, in: JSAI International Symposium on References Artificial Intelligence, Springer, 2019, pp. 34–49. [1] S. A. Lastres, Rebooting legal research in a digital [14] V. Tran, M. L. Nguyen, K. Satoh, Building legal case age, 2015. retrieval systems with lexical matching and summa- [2] J. Rabelo, M.-Y. Kim, R. Goebel, M. Yoshioka, rization using a pre-trained phrase scoring model, Y. Kano, K. Satoh, COLIEE 2020: Methods for in: Proceedings of the Seventeenth International Legal Document Retrieval and Entailment, 2020. Conference on Artificial Intelligence and Law, 2019, URL: https://sites.ualberta.ca/~rabelo/COLIEE2021/ pp. 275–282. COLIEE_2020_summary.pdf. [15] Y. Shao, J. Mao, Y. Liu, W. Ma, K. Satoh, M. Zhang, [3] D. Locke, G. Zuccon, H. Scells, Automatic query S. Ma, BERT-PLI: Modeling Paragraph-Level Inter- generation from legal texts for case law retrieval, in: actions for Legal Case Retrieval, in: Proceedings of Asia Information Retrieval Symposium, Springer, the Twenty-Ninth International Joint Conference 2017, pp. 181–193. on Artificial Intelligence, IJCAI-20, ???? [4] Y. Shao, B. Liu, J. Mao, Y. Liu, M. Zhang, S. Ma, [16] A. Kanapala, S. Pal, R. Pamula, Text summarization Thuir@ coliee-2020: Leveraging semantic under- from legal documents: a survey, Artificial Intelli- standing and exact matching for legal case retrieval gence Review 51 (2019) 371–402. and entailment, arXiv preprint arXiv:2012.13102 [17] N. Van de Luijtgaarden, Automatic Summariza- (2020). tion of Legal Text, Master’s thesis, Utrecht Univer- [5] V. Tran, M. Le Nguyen, S. Tojo, K. Satoh, Encoded sity, 2019. URL: https://dspace.library.uu.nl/handle/ summarization: summarizing documents into con- 1874/384802. tinuous vector space for legal case retrieval, Artifi- [18] J. Guo, Y. Fan, Q. Ai, W. B. Croft, A deep relevance cial Intelligence and Law 28 (2020) 441–467. matching model for ad-hoc retrieval, in: Proceed- [6] J. Rossi, E. Kanoulas, Legal information retrieval ings of the 25th ACM international on conference on information and knowledge management, 2016, pp. 55–64. [31] A. Trotman, A. Puurula, B. Burgess, Improvements [19] I. Sekulić, A. Soleimani, M. Aliannejadi, F. Crestani, to bm25 and language models examined, in: Pro- Longformer for ms marco document re-ranking ceedings of the 2014 Australasian Document Com- task, arXiv preprint arXiv:2009.09392 (2020). puting Symposium, 2014, pp. 58–65. [20] S. Hofstätter, H. Zamani, B. Mitra, N. Craswell, [32] A. Lipani, M. Lupu, A. Hanbury, A. Aizawa, Ver- A. Hanbury, Local self-attention over long text for boseness fission for bm25 document length normal- efficient document retrieval, in: Proceedings of the ization, in: Proceedings of the 2015 International 43rd International ACM SIGIR Conference on Re- Conference on the Theory of Information Retrieval, search and Development in Information Retrieval, 2015, pp. 385–388. 2020, pp. 2021–2024. [33] H. Zamani, M. Dehghani, W. B. Croft, E. Learned- [21] S. Verberne, M. Sappelli, D. Hiemstra, W. Kraaij, Miller, J. Kamps, From neural re-ranking to neural Evaluation and analysis of term scoring methods ranking: Learning a sparse representation for in- for term extraction, Information Retrieval Journal verted indexing, in: Proceedings of the 27th ACM 19 (2016) 510–545. international conference on information and knowl- [22] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- edge management, 2018, pp. 497–506. tras, I. Androutsopoulos, Legal-bert: The mup- pets straight out of law school, arXiv preprint arXiv:2010.02559 (2020). [23] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En- riching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146. [24] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duch- esnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [26] J. Frej, P. Mulhem, D. Schwab, J.-P. Chevallet, Learn- ing term discrimination, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 1993–1996. [27] I. Chios, S. Verberne, Helping results assessment by adding explainable elements to the deep relevance matching model, 2020. URL: https://ears2020.github. io/accept_papers/2.pdf. [28] J. Frej, D. Schwab, J.-P. Chevallet, Mlwikir: A python toolkit for building large-scale wikipedia- based information retrieval datasets in chinese, en- glish, french, italian, japanese, spanish and more (????). [29] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of machine learning research 3 (2003) 1157–1182. [30] M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, C. Burges, Optimisation methods for ranking func- tions with multiple parameters, in: Proceedings of the 15th ACM international conference on In- formation and knowledge management, 2006, pp. 585–593.