Combining lexical and neural retrieval with
longformer-based summarization for effective case law
retrieval
Arian Askari1 , Suzan Verberne1
1
    Leiden Institute of Advanced Computer Science, Leiden University


                                             Abstract
                                             In this paper, we combine lexical and neural ranking models for case law retrieval. In this task, the query is a full case
                                             document, and the candidate documents are prior cases that are potentially relevant to the current case. Most documents
                                             are longer than 1024 tokens, which makes retrieval and classification with Transformer-based models problematic. We
                                             create shorter query documents with different methods: term extraction, noun phrase extraction, entity extraction, and
                                             automatic summarization using Longformer-Encoder-Decoder (LED). We then combine the summaries with five different
                                             ranking models: a BM25 ranker, statistical language modelling, the Deep Relevance Matching Model (DRMM), a Vanilla BERT
                                             ranker, and a Longformer ranker. We optimised all models and combined the best lexical ranker with neural retrieval models
                                             using different ensemble classifiers. We evaluate our methods on the retrieval benchmarks from COLIEE’20 and COLIEE’21.
                                             We beat state-of-the-art models for case law retrieval with both benchmark sets. Our experiments show the importance of
                                             tuning lexical retrieval methods, summarizing query documents, and combining lexical and neural models into one ranker
                                             for effective case law retrieval. In addition, training and optimizing our rankers is much faster than passage-level retrieval
                                             models (a few hours compared to several days for training).

                                             Keywords
                                             Legal Information Retrieval, Query Summarization


1. Introduction                                                                                                       In this paper, we address legal case retrieval (Task 1).
                                                                                                                         One of the challenges of case law retrieval in COL-
In countries with common law systems, finding support-                                                                IEE’20 is that the input is a long case document with a
ing precedents to a new case, is vital for a lawyer to                                                                median length of 2,815 words instead of a keyword query.
fulfill their responsibilities to the court. However, with                                                            Four approaches to this problem exist. The first is the use
the large amount of digital legal records – the number                                                                of unsupervised keyword extraction methods to create
of filings in the U.S. district courts for total cases and                                                            short queries from the long query document [3]. A vari-
criminal defendants was 544,460 in 20201 – it takes a sig-                                                            ant is the unsupervised extraction of phrases or entities
nificant amount of time for legal professionals to scan for                                                           as query terms [4]. The second approach, proposed by
specific cases and retrieve the relevant sections manually.                                                           Tran et al. [5], is to train a supervised phrase scoring
Studies have shown that attorneys spend approximately                                                                 model for 𝑛-gram phrases to select the phrases that are
15 hours in a week seeking case law [1].                                                                              semantically closest to an expert-written summary. The
   This workload necessitates the need for information                                                                third approach, proposed by Rossi and Kanoulas [6], is
retrieval (IR) systems specifically designed for the legal                                                            to use document summarization methods for creating
domain. The Competition on Legal Information Extrac-                                                                  shorter query documents. The fourth approach, success-
tion/Entailment (COLIEE) is a workshop that has been                                                                  fully employed by Shao et al. [4] and Westermann et al.
organized since 2014 as a series of evaluation competi-                                                               [7], is to analyze the documents on the level of individual
tions related to case law [2]. COLIEE defines four tasks.                                                             paragraphs and then aggregate the paragraph scores in a
                                                                                                                      document ranking.
DESIRES 2021 – 2nd International Conference on Design of                                                                 In this paper, we use automatic summarization for
Experimental Search Information REtrieval Systems, September                                                          creating query documents. We experiment with term
15–18, 2021, Padua, Italy                                                                                             extraction, noun phrase extraction, and supervised text
" a.askari@liacs.leidenuniv.nl (A. Askari);
s.verberne@liacs.leidenuniv.nl (S. Verberne)
                                                                                                                      summarizers. As opposed to prior work, we approach
~ https://www.universiteitleiden.nl/en/staffmembers/arian-askari                                                      the task as an abstractive summarization problem. The
(A. Askari); https://liacs.leidenuniv.nl/~verbernes/ (S. Verberne)                                                    current state of the art in abstractive summarization is
 0000-0003-4712-832X (A. Askari); 0000-0002-9609-9505                                                                the use of Transformer models [8, 9]. However, the input
(S. Verberne)                                                                                                         of pre-trained available models of these architectures
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
                  http://ceur-ws.org
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                          CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                      is limited to 1024 tokens, and the majority of case law
                                                                                                                      documents in our collection is longer than that. Beltagy
                  ISSN 1613-0073
    Proceedings


                  1
                      https://www.uscourts.gov/statistics-reports/judicial-business-
2020                                                                                                                  et al. [10] proposed Longformer-Encoder-Decoder (LED),
which is a Transformer variant that supports much longer      ment summarization.
inputs. In this paper, we evaluate the effectiveness of LED      In COLIEE’20, the two best-performing teams use
for case law retrieval.                                       paragraph-level analyses to cope with the challenge of
   We evaluate summarization with multiple ranking            long documents. Shao et al. [4] (‘TLIR’) participate with
models: probabilistic lexical ranking (BM25), statistical     their method BERT-PLI [15], which models paragraph-
language modelling, the Deep Relevance Matching Model         level interactions. They combine BERT-PLI with lexical
(DRMM), and Transformer-based architectures including         matching features using word-entity duet model. The
the Vanilla BERT model from Contextualized Embed-             features are different lexical rankers (BM25, probabilistic
dings for Document Ranking (CEDR) [11]. Although we           language modelling) – without optimization – on the full
summarize the query documents, the lengths of the doc-        document content, and entities extracted by NLTK. They
uments in the retrieval collection still causes problems      reach competitive results. The method by Westermann
for Transformer-based ranking architectures, which are        et al. [7] (‘cyber’) selects the top-30 candidate documents
limited to 512 tokens as the input length. To solve that      using a paragraph similarity score based on the universal
problem, we implemented a Vanilla Longformer ranker           sentence encoder, and then applies an SVM model to the
[10], with 4096 tokens as the input length, in a pair-wise    TF-IDF representations of the query document and the
ranking setting similar to Vanilla BERT. After we have        candidate documents.
optimized each individual ranker, we experiment with en-         We use the supervised phrase-based summarization of
semble methods that combine lexical rankers with neural       Tran et al. [14, 5] as comparison for our summarization
rankers.                                                      task, and the highest F-score obtained in COLIEE’20 by
   Our contributions are four-fold: (1) We deliver a fine-    Westermann et al. [7] as comparison for our retrieval
tuned Longformer-Encoder-Decoder (LED) for abstrac-           task.
tive summarization of legal documents; (2) we deliver
a pairwise Longformer ranker for long documents; (3)          2.2. Summarization of long documents
We show that summarizing query documents with LED
improves all ranking models; (4) We show that the com-     Kanapala et al. [16] and Van de Luijtgaarden [17] give
bination of a lexical ranker and a Vanilla BERT ranker in  an extensive overview of research on legal document
a simple ensemble classifier outperforms all baselines on  summarization up to 2019. Here we focus on the recent
COLIEE’20 and an optimized BM25 ranker with keyword        abstractive summarization models.
queries beats the state-of-the-art models on COLIEE’21.       Pre-trained encoder-decoder Transformer models (e.g.
                                                           BART [8] and T5 [9]) have achieved strong results in
                                                           abstractive summarization tasks. However, pre-trained
2. Related Work                                            models of these architectures are limited to texts that
                                                           are shorter than 1024 tokens. Legal documents are com-
2.1. Case law retrieval                                    monly longer than that; 72% of the query documents in
Locke et al. [3] investigate query generation from legal COLIEE’20 are. Recently, Beltagy et al. [10] proposed
decisions using unsupervised keyword extraction models. Longformer-Encoder-Decoder (LED), which is a Trans-
They find that the best performing model is Kullback- former variant that supports sequence-to-sequence tasks
Leibler divergence for informativeness (KLI) [12] and that for longer documents (up to 16k tokens). The authors
the automatically generated queries were more effective show that LED outperforms the state-of-the-art models
than the average Boolean queries from experts.             on the arXiv dataset. In this paper, we evaluate the effec-
   The majority of the work addressing case law retrieval tiveness of LED for case law retrieval.
takes place in the context of COLIEE, the Competition
on Legal Information Extraction and Entailment [13, 2]. 2.3. Transformer-based document
Rossi and Kanoulas [6] proposed a pairwise ranking               ranking
model based on BERT. They apply automatic summa-
rization using TextRank, to make the input document The typical approach in neural IR is two-step retrieval,
length suitable for use in the BERT-based ranker. The where a first set of documents is retrieved using a tra-
most successful team in COLIEE 2019 is Tran et al. [14, 5] ditional ranker (e.g. BM25) and those document are re-
(‘JNLP’). They train a phrase scoring model that extracts ranked by a neural model that is trained on the relevance
𝑛-gram phrases (𝑛 ≥ 5) to summarize the documents. assessments [18, 11, 19].
The weights are calculated based on the phrase score          MacAvaney et al. [11] propose a joint approach that
framework that was trained on COLIEE’18 summaries. integrates the classification vector of BERT into existing
The authors achieved the state-of-the-art result on COL- neural models like the Deep Relevance Matching Model
IEE’19, thereby showing the importance of query docu- (DRMM), resulting in Contextualized Embeddings for
                                                           Document Ranking (CEDR). They fine-tune a pretrained
BERT model with a linear combination layer stacked atop          Entity and phrase extraction Inspired by Shao et al.
the [CLS] token as the Vanilla BERT ranker with pairwise         [4], we used NLTK3 to extract noun phrases and named
cross-entropy loss and the Adam optimizer. The authors           entities from the content of query documents and candi-
demonstrate that Vanilla BERT and CEDR outperform                date documents and used the extracted strings entities
the state-of-the-art baselines in ad-hoc rankings. How-          as the representation documents (see Section 4.4 for the
ever, local-interaction neural ranking architectures like        combinations we experimented with).
DRMM are not scalable to long documents or need heavy
interaction between word pairs in query and document.            Abstractive summarization We experimented with
Therefore, we will compare our method to the Vanilla             the pre-trained LED model4 of Beltagy et al. [10] which
BERT ranker.                                                     can process documents up to 16k tokens as input. Also,
   More recent work has addressed the challenges of rank-        we fine-tuned LED on the COLIEE’18 dataset, in which
ing for long documents. Hofstätter et al. [20] propose a         more than 80% of documents have summaries.5 After
local attention Transformer model that uses a moving             removing duplicates, 6, 257 unique documents are left
window over the document terms and for each term at-             for which a summary is available. For evaluation pur-
tends only to other terms in the same window. They               poses, we trained the model in a k-fold cross-validation
obtain results significantly better than other state-of-the-     setting (𝑘 = 10) for one epoch per fold, each with batch
art models on the TREC Deep learning track. Sekulić              size 1. We kept the other hyperparameters (optimizer,
et al. [19] take a full-document approach by training a          dropout, weight decay) identical to [10] and set the global
Longformer model for ranking in ad-hoc retrieval. The            attention on the first <s> token. We only summarized
results they report on the MS MARCO set are low com-             query documents with LED for the lexical rankers since
pared to the leaderboard results. Our Longformer ranker          they do not have limitation for input length. On other
is similar to [19], but instead of implementing the ranker       hand, since Transformer-based neural models are limited
as a one-versus-all classifier, we train and evaluate it in a    in input length for the candidate document content, we
pair-wise setting and we evaluate it for case law retrieval.     also experiment with summarizing candidate documents
                                                                 besides query documents for the best Transformer-based
                                                                 neural model in our experiments.
3. Methods
3.1. Summarization                                               3.2. Ranking models
We experiment with three approaches to creating shorter          As introduced in section 3.2.1, we rank the 200 candidate
query documents: (1) term extraction, (2) noun phrase            documents for each query document in COLIEE’20 with
or entity extraction, and (3) abstractive summarization.         multiple retrieval models. We optimise the hyperparam-
                                                                 eters of each method on a validation set (see Section 4.3).
Summarization through term extraction We                         In the following, we will introduce each neural ranker
adopted Kullback-Leibler divergence for Informativeness          that we use for legal case retrieval.
(KLI), similar to Locke et al. [3] in the implementation of
Verberne et al. [21]. For each term 𝑡 in a query document,       3.2.1. Lexical rankers
we computed the KLI score:                              BM25 We indexed the COLIEE’20 collection with Elas-
                                     𝑃 (𝑡|𝐷)            ticSearch. The collection has 200 candidate documents
           𝐾𝐿𝐼(𝑡) = 𝑃 (𝑡|𝐷) × log                   (1) for each query that need to be ranked. We used BM25
                                     𝑃 (𝑡|𝐶)
                                                        with the default parameter values 𝑘 = 1.2 and 𝑏 = 0.75,
where 𝑃 (𝑡|𝐷) is the probability of 𝑡 in the query doc- as well as with optimized hyperparameter values.
ument 𝐷 and 𝑃 (𝑡|𝐶) is the probability of 𝑡 in a back-
ground language model. We use all candidate documents Language Modelling We used the built-in similar-
as the background collection to compute 𝑃 (𝑡|𝐶). We ity functions of Elasticsearch for the implementation of
only consider unigrams as terms and we lowecase them.2 Language Modelling (LM) with two different smoothing:
We then selected the top-10% of the total number of Dirichlet smoothing and Jelinek Mercer (JM) smoothing.
terms in the document with the highest KLI score as a We only report the results for JM smoothing since we got
the query.
                                                                      3
                                                                        https://www.nltk.org/api/nltk.chunk.html?#nltk.chunk.util.
                                                                 tree2conlltags and https://www.nltk.org/api/nltk.chunk.html#
                                                                 module-nltk.chunk.named_entity
                                                                      4
                                                                        https://huggingface.co/allenai/led-large-16384-arxiv
    2                                                                 5
      We tried to extract n-gram phrases (2<𝑛<=5) with 𝛾 = 0.8          In COLIEE’19 and COLIEE’20 the large majority (82%) of the
but obtained the best results with unigrams.                     candidate cases do not have a summary.
similar results by these two smoothing methods. We also       Our code is integrated with Vanilla BERT in CEDR
optimised the hyperparameter value (𝜆) for Language         [11], and is available for future work.6
Modelling with Jelinek Mercer smoothing (LM JM).
                                                            3.2.3. Ensemble models
3.2.2. Neural rankers
                                                            For combining the advantages of neural rankers and lex-
Deep Relevance Matching Model (DRMM) As the                 ical rankers in one integrated system, we train ensemble
DRMM architecture [18] is based on a local interaction      models that take the scores of multiple rankers as fea-
input matrix, only a limited query length is possible. We   tures. We experiment with three different classifiers for
took the top 10% of query terms sorted by the KLI score     this purpose: SVM with a linear kernel, SVM with an RBF
for each query document. Then, we selected the average      kernel, Naive Bayes, and Multi Layer Perceptron (MLP).
resulting number of terms, 70, as the maximum length        We experiment with different combinations of rankers’
of query. Thus, we used the top-70 query terms as query     scores to find the best combination of rankers for having
in DRMM. For calculating cosine similarity in the local     an effective ranking.
interaction matrix, we trained word2vec on the query
candidates’ texts as suggested by the DRMM authors.
Furthermore, we optimize the network configuration of       4. Experiments and results
DRMM to find the best combination of layers and neu-
rons in legal case retrieval on COLIEE (see details in      4.1. Data
Section 4.3).                                              For our experimental evaluation, we work with data from
                                                           the COLIEE competitions in 2018, 2020, and 2021.
Vanilla BERT ranker For Vanilla BERT, we fine-tuned           The COLIEE’18 data contains human-written sum-
a pre-trained BERT model (BERT-Base, Uncased) with maries of the case documents, which we use for the train-
a linear combination layer stacked atop the classifier ing evaluation of our summarization models (Section 4.2).
[CLS] token on the COLIEE dataset in pairwise cross- For our retrieval experiments (Section 4.3), we use data
entropy loss setting using the Adam optimizer. We used from COLIEE’20 and ’21.7 The Federal Court of Canada
the implementation of MacAvaney et al. [11] (CEDR). We provided case laws with metadata for task 1. The meta-
represent the query as sentence A and the document as data contains references to the noticed cases that are
sentence B in the BERT input:                              the golden relevance labels for the query document. In
                                                           COLIEE’20, there is a pool of 200 candidates for each
 “[CLS] query document [SEP] candidate document [SEP]" query document and the competitors should re-rank a
                                                           limited number of documents per query. The pool of
We truncate the query and candidate document text since
                                                           candidates includes the noticed cases and non-relevant
the BERT tokenizer is limited to 512 tokens.
                                                           candidates, which are selected randomly. In contrast, in
                                                           COLIEE’21, the whole collection should be considered
Vanilla Legal BERT ranker For Vanilla Legal BERT, per query without having a pool of candidates. This dif-
we used LEGAL-BERT [22] which was pre-trained on ference makes task 1 in COLIEE’21 more difficult than in
legal data.                                                COLIEE’20.
                                                              We use the COLIEE’20 data to evaluate all ranking
Vanilla Longformer ranker Since the input length models and ensembles described in Section 3. In the
is limited in Vanilla BERT, we implemented the Vanilla COLIEE’20 data, there are 520 query documents in the
Longformer in CEDR [11] as a ranker which can receive train set and 130 in the test set; with 104,000 candidate
4096 tokens instead of 512 and has more chances to work documents in the train set and 26, 000 in the test set. The
effectively than Vanilla BERT. In Longformer, the [CLS] average length of the documents in the test set is 3,232
and [SEP] tokens are replaced by the tokens <s>, and words, with outliers upto 10, 827. After we have found
</s> respectively. As suggested in the Longformer paper the best-performing rankers and ensemble, we evaluate
[10] we calculate the loss based on the <s> token with the those on the COLIEE’21 data and compare the results
addition of global attention to the <s> token. Inspired to the best results reported in the competition. In the
by Sekulić et al. [19] we feed the query document as COLIEE’21 data, there are 650 query documents in the
sequence A and the candidate document as sequence B train set and 250 in the test set, with 4, 415 documents
to the tokenizer, which yields the following input to the as candidate documents for both train set and test set.
model:
                                                         6
                                                           https://anonymous.4open.science/r/vanilla_
“<s> query document </s> candidate document </cls>". longformer-D552/README.md
                                                         7
                                                           https://sites.ualberta.ca/~rabelo/COLIEE2020/         and
                                                     https://sites.ualberta.ca/~rabelo/COLIEE2021/
Table 1
Summarization results in terms of ROUGE for COLIEE’18. Summary length is 10% of the original text
                                        ROUGE-1                      ROUGE-2                       ROUGE-SU6
               Model\Metric
                                        Pre   Rec         F1         Pre   Rec           F1        Pre   Rec            F1
        JNLP [5]                        0.482 0.409       0.405      0.186 0.152         0.152     0.258 0.199          0.167
        Pre-trained LED on arXiv [10]   0.634 0.164       0.260      0.299 0.072         0.116     0.332  0.078         0.127
        Fine-tuned LED                  0.620 0.304       0.408      0.295 0.138         0.188     0.319  0.147         0.201


The average length of the candidate documents is 1, 274,          LM We found 𝑘 = 6 as the optimal cut-off for all vari-
with outliers upto 76, 818 words.                                 ants of LM JM. For the optimization, we searched the
                                                                  following values: 𝜆 ∈ {0, 0.1, 0.2, ·, 1}.10
4.2. Summarization experiments
                                                                  DRMM We optimize the word2vec model and also
We set the local attention window size to 512 tokens. To          tune the network configuration (e.g., numbers of lay-
limit memory use, we use gradient checkpointing and set           ers and hidden nodes) on the validation set. 𝑘 = 6 is
the input size in training to 8,192 tokens which covers           the optimal cut-off value. We trained six word2vec and
more than 86% of COLIEE’18 documents completely                   fasttext models for three configurations from the litera-
(the longer documents are truncated at 8,912 tokens). We          ture [18, 23, 24]. We also used the pre-trained word2vec
set the maximum length to generate a summary for an               on google-news and pre-trained fasttext on Wikipeda.
unseen document as 10% of the length of the original              We found that word2vec which was pre-trained accord-
text.                                                             ing to the DRMM configuration gave the best results. For
   We evaluate our fine-tuned LED to the pre-trained              the network configuration, we use a four-layer architec-
LED [10], and compare it to the summarizer of Tran et al.         ture throughout all experiments, i.e., one histogram input
[5] (JNLP) on the COLIEE’18 data. Table 1 shows that              layer (30 nodes), two hidden layers in the feed-forward
our fine-tuned summarizer outperforms the baseline in             matching network (128 nodes for both layers), and one
terms of F-measure for ROUGE-1, ROUGE-2 and ROUGE-                output layer (1 node) with the term gating network for
SU6 on the COLIEE’18 dataset. The pre-trained LED                 the final matching score. We set the maximum query
summarizer obtains the highest precision-ROUGE scores,            length to 70 tokens as explained in Section 3.2.2.
and the JNLP baseline the highest recall-ROUGE scores.
                                                           Vanilla BERT ranker We truncate the documents
4.3. Retrieval experiments                                 such that the concatenated query document (truncated
                                                           at 100 words), candidate document, and the separator
As the validation set for optimizing the rankers we use a
                                                           tokens do not exceed 512 tokens. We re-rank the top-30
held-out subset (10% of the training set). We optimize the
                                                           BM25 results. Since the 𝑅@30 is about 95% for BM25,
rankers for each document representation. In the text be-
                                                           our ranker has the possibility to achieve 95% recall while
low, we refer to noun phrase representations as Q NP and
                                                           still having a reasonable runtime. 𝑘 = 6 was the optimal
C NP for the query/candidate documents respectively,
                                                           cut-off for the arXiv-LED summarizer queries, and 𝑘 = 4
and to entity representations as Q Entities and C Enti-
                                                           for the fine-tuned LED summarizer queries. We train
ties. We use Precision@k, Recall@k and F-measure@k
                                                           each model for 100 epochs, each with 32 batches of 16
as evaluation metrics, following the COLIEE evaluation
            8                                              training pairs, with the initial learning rate of 3 * 10−5 ,
mechanism. We select the best cut-off 𝑘 for each method
                                                           followed by a power 3 polynomial decay.
based on the validation set and use that on test set.

                                                                  Vanilla Longformer ranker The local window size
BM25 We found 𝑘 = 6 as the optimal cut-off for rank-
                                                                  is set to 512. We fine-tune the pre-trained Longformer
ing with the whole text, and 𝑘 = 4 for summarized
                                                                  using pairwise hinge loss. Positive and negative training
input. For the optimization we searched the following
                                                                  documents are selected from the relevance judgments.
grid: 𝑏 ∈ {0, 0.1, 0.2, ·, 1} and 𝑘1 ∈ {0, 0.1, 0.2, ·, 3}.
                                                                  We truncate the document such that the sequence of
For BM25 with KLI the best parameters were 𝑏 = 0.9,
                                                                  the concatenated query document (summary), candidate
and 𝑘1 = 2.8.9
                                                                  document, and the separator tokens do not exceed 4,096

    8                                                                 10
      See https://sites.ualberta.ca/~rabelo/COLIEE2020/                  We found 𝜆 = 0.1 as the best parameters for LM JM with
    9
      We found 𝑏 = 0.6, 0.8, 0.8 and 𝑘1 = 1.8, 1.4, 1.4 as the    KLI. We found 𝜆 = 0.1, 0.6, 0.4 as the best parameters for LM JM
best parameters for BM25 (Q NP C NP, Q words C NP, and Q NP C     (Q entities D entities, Q words D entities, and Q entities D words)
words) respectively.                                              respectively.
Table 2                                                              Table 3
Lexical retrieval results for the ranking of 200 candidate doc-      Neural retrieval results for the ranking of 200 candidate doc-
uments in COLIEE’20. Q refers to query content and C refers          uments in COLIEE’20. Q refers to query content and C refers
to candidate document content. SummaryQ means that the               to candidate document content. SummaryQ/SummaryQC
summary of the query document is used as query. NP refers            means that the summary of the query/both query and can-
to the extracted noun phrases (NP) using NLTK. Q/C NP                didate document (generated by fine-tuned LED) are used as
means that the extracted noun phrases from query and can-            the content in the ranker’s input.
didate document are used as the query and candidate docu-             Method               Extractor/Sumarizer             P%          R%      F1 %
ment content in the ranker’s input.                                   DRMM                 original text                   27.05       37.16   31.31
                                                                      DRMM                 KLI (1-gram)                    47.95       67.43   56.05
 Method            Extractor/Sumarizer       P%      R%      F1 %
                                                                      Vanilla BERT         original text                   36.92       50.16   42.53
 BM25              original text             47.69   67.48   56.06
                                                                      Vanilla BERT         summaryQ arXiv-LED              40.26       57.02   47.19
 BM25 optimized    original text             57.31   59.15   58.21
                                                                      Vanilla BERT         summaryQ fine-tuned LED         49.23       62.22   54.96
 BM25              KLI (1-gram)              58.85   62.28   60.51
                                                                      Vanilla BERT         summaryQC fine-tuned LED        29.10       45.50   35.49
 BM25 optimized    KLI (1-gram)              67.00   61.17   63.95    Vanilla Legal BERT   summaryQ fine-tuned LED         46.77       61.54   53.14
 BM25              summaryQ arXiv-LED        48.65   69.03   57.07    Vanilla Legal BERT   summaryQC fine-tuned LED        40.26       58.34   47.64
 BM25 optimised    summaryQ arXiv-LED        54.20   64.91   59.07
                                                                      Longformer           original text                   30.38       39.00   34.15
 BM25              summaryQ fine-tuned LED   49.72   68.59   57.65
                                                                      Longformer           summaryQ arXiv-LED              35.10       43.40   38.81
 BM25 optimised    summaryQ fine-tuned LED   55.71   63.18   59.21
                                                                      Longformer           summaryQ fine-tuned LED         39.04       44.20   41.46
 BM25 optimized    Q entities & C words      62.05   50.62   55.75
 BM25 optimized    Q words & C entities      48.77   58.89   53.35
 BM25 optimized    Q NP & C NP               55.77   58.05   56.88
 LM JM             original text             46.54   62.25   53.26
                                                                     Table 4
 LM JM             KLI (1-gram)              58.00   61.85   59.86   Results of ensemble classifiers and comparison with the Cy-
 LM JM optimized   KLI (1-gram)              66.24   60.28   63.11   ber team which is the best team on COLIEE’20 [7]. Precision
 LM JM             arXiv-LED                 49.20   68.55   57.28
                                                                     and recall are not reported for the COLIEE’20 best result
 LM JM             fine-tuned LED            49.68   68.44   57.57
 LM JM optimized   Q NP & C words            61.94   50.30   55.51    Method         Features                          P%          R%          F1 %
 LM JM optimized   Q words & C NP            47.65   57.94   52.29    Naive Bayes    Best BM25 & best Vanilla BERT     72.95       83.66       76.02
 LM JM optimized   Q NP & C NP               55.19   57.67   56.40    SVM linear     Best BM25 & best Vanilla BERT     71.74       83.77       74.61
                                                                      SVM RBF        Best BM25 & best Vanilla BERT     70.11       83.24       72.37
                                                                      MLP            Best BM25 & best Vanilla BERT     70.68       83.46       73.20
                                                                      Cyber          (first team in COLIEE’20)         -           -           67.74
tokens. We again re-rank the top-30 BM25 results. for
the arXiv-LED summarizer queries, and 𝑘 = 4 for the
fine-tuned LED summarizer queries. We use the same                   Table 5
training configuration as for Vanilla BERT.                          Retrieval results for the ranking of documents of COLIEE’21.
                                                                     Precision and recall are not reported for the COLIEE’21 best
                                                                     result. 𝑘 is the optimal cut-off value for each method on this
Ensemble classifier We used the Scikit-learn [25] li-                data.
brary for training the ensemble classifiers and kept the              Method               Extractor/Sumarizer         P%          R%          F1 %
hyperparameters as default. We used the classifier pre-               BM25                 original text (𝑘 = 7)       7.77        19.59       11.13
diction (relevant/non-relevant) to obtain the returned                BM25                 KLI (1-gram) (𝑘 = 6)        9.83        19.80       13.13
set of documents that we evaluate. Note that the cut-off              BM25 optimized       KLI (1-gram) (𝑘 = 4)        17.00       25.36       20.35
                                                                      Vanilla BERT         fine-tuned LED (𝑘 = 7)      2.11        5.46        3.04
parameter 𝑘 is not needed in this approach because the
                                                                      TLIR                 (first team in COLIEE’21)   -           -           19.17
final ranking only includes documents that are predicted
relevant by classifier.
                                                                     ble 4. With our ensemble models we improve over the
4.4. Retrieval results                                               best benchmark result (the Cyber team) by a large margin.
COLIEE’20 results The ranking results for the lexical                The best ensemble model in terms of F1 on the validation
and neural ranking models are shown in Table 2 and                   set is a combination of the best BM25 ranker (BM25-
Table 3, respectively. The best lexical ranker is BM25, but          optimised + KLI) and the second-best neural ranker
LM JM with KLI is very close. The best neural ranker in              Vanilla BERT (Vanilla BERT + SummaryQ (fine-tuned
terms of recall and F1 is DRMM. The best single ranker               LED)). This indicates that BERT can add more to the com-
overall is BM25 with an F1 score of 63.95%. The highest              bination with BM25 than the best neural model DRMM
precision is obtained by BM25 with KLI queries and the               can.
highest recall with BM25 with arXiv-LED. This indicates
that lexical matching is important for this dataset. Table 2         COLIEE’21 results For evaluating the generalizability
also shows the (mostly positive) effect of optimizing the            of our results, we evaluated the best methods on the COL-
parameters of BM25, an effort that is not taken by most              IEE’21 data. As explained in Section 4.1, the COLIEE’21
of the COLIEE participants.                                          task is more difficult than the COLIEE’20 task because it
   The results of the ensemble models are shown in Ta-               requires retrieval from a full document collection instead
of re-ranking 200 documents. We optimised the BM25           Analysis weights of classifiers As suggested in [29],
parameters and the cut-off value 𝑘 on the COLIEE’21          we interpret the importance of Vanilla BERT feature in
validation set. As shown in Table 5, we beat the state-      the ensemble classifiers based on the coefficient value in
of-the-art result (TLIR team) with the optimized BM25        fitted SVM (linear). For Vanilla BERT the coefficient is
ranker. The poor result of Vanilla BERT on COLIEE’21         higher (0.56) than BM25’s coefficient (0.43).
shows us that neural retrieval has more challenges for
ranking the whole collection than with re-ranking the      Analysis hyperparameters of BM25 The optimal
top-200. We also applied ensemble classifiers combining    value for 𝑏 in the literature is between 0.3 − 0.9 [30,
BM25 and Vanilla BERT but they could not improve the       31, 32] and we found 𝑏 = 0.9 for the optimised BM25
effectiveness; we suppose that it is caused by the lower   with KLI. There are documents in COLIEE that, because
performance of Vanilla BERT on COLIEE’21.                  of their length, contain multiple topics, and it was sug-
                                                           gested before that documents that includes a variety of
                                                           topics benefit from using a larger 𝑏 so that unrelated top-
5. Discussion                                              ics to a user’s search are penalized.11 The normal range
The effect of summarization The results show that for 𝑘1 is between 0 and 3 and for long documents that
summarizing the query document improves all rankers. contain diverse information the 𝑘1 should tend to larger
This holds for the KLI term extraction, noun phrase ex- numbers [32]. We found the optimal value for 𝑘1 = 2.8
traction, and the LED summarizer. This shows the impor- for BM25 with KLI which makes sense because of the
tance of summarization for making the query documents long documents in COLIEE.
shorter. Our results show that the best summarizing
methods for neural ranking and lexical ranking are LED Using noun phrases or named entities We experi-
(fine-tuned) and keyword extraction (KLI) respectively. mented with noun phrases and named entities as doc-
However, summarizing the candidate documents does ument representation, inspired by Shao et al. [4]. Our
not improve the neural ranking performance. Another experiments show that the effectiveness of using named
observation is that fine-tuning the summarizer improves entities instead of original content is much lower than
the ranking in terms of F-measure for all rankers. We see of using noun phrases: The F1 score for BM25 optimised
the largest effect from summarization for Vanilla BERT. + Q entities C entities is 22.71% while F1 for BM25 op-
                                                           timised + Q NP C NP is 56.88%. Although the use of
Analysis of unexpected results One unexpected re- named entities was suggested in prior work, they do not
sult is that our Longformer ranker does not outperform play an important role in case law documents, at least not
the Vanilla BERT ranker. We speculate that this is because so much that using them as document representations
longformer receives more tokens as the input learning, leads to effective retrieval.
which makes it more difficult to estimate the relevance
between document and query, while the size of the train-     Future work Some prior work has obtained good re-
ing set is relatively small: In COLIEE’20, we have only      sults with passage-level analysis for long document rank-
2, 680 relevant labels and Longformer could not converge     ing [24, 33]. We think it is a promising direction to com-
during training – it did not find the optimal loss after     bine these passage-level with document-level methods
100 epochs. For future work, we will further pre-train       to design a more effective legal case retrieval system
Longformer on legal documents. This requires a GPU           for future work. One challenge is that this approach
with 32GB Ram which is not easily accessible.                is computationally expensive since each paragraph of
   Our results also show that DRMM could not beat BM25       the query document needs to be compared with each
for case law retrieval. Some prior work has also indicated   paragraph of the candidate documents. We have query
that for some datasets, DRMM is close to BM25 in quality     documents with up to 1,139 paragraphs in COLIEE. For
and that sometimes BM25 works better than DRMM [26,          future work we will focus on combining lexical document
27, 28], especially when BM25 is properly optimized.         retrieval with efficient paragraph-level retrieval. We will
   The third unexpected result is that the best Vanilla      also evaluate how we can use COGLTX [5] to recognize
Legal BERT (summaryQ) does not outperform the best           the important sentences from the query cases and docu-
Vanilla BERT model. We suppose this can be related to        ment cases without having the limitation in length as a
language similarity between COLIEE cases and part of         pre-processing step. Inspired by [33], another direction
the data that BERT Base was trained on (Wikipedia, and       is working on neural models for legal case retrieval on
more than thousand thousands books) because cases in         ranking instead of re-ranking because in coliee’21 there
COLIEE contain stories of applicant’s lives and this is      are 4, 415 candidates per query (whole collection).
almost the bigger part of each document.
                                                                11
                                                                     https://www.elastic.co/blog/practical-bm25
6. Conclusions                                                      with generalized language models, Proceedings of
                                                                    the 6th Competition on Legal Information Extrac-
In this paper, we addressed the challenge of long doc-              tion/Entailment. COLIEE (2019).
uments in case law retrieval. We experimented with              [7] H. Westermann, J. Šavelka, K. Benyekhlef, Para-
the Longformer-Encoder-Decoder (LED) for abstractive                graph similarity scoring and fine-tuned BERT for
summarization of case law documents in the COLIEE                   legal information retrieval and entailment., in: COL-
benchmark data. The fine-tuned LED outperforms the                  IEE 2020, 2020.
2019 baseline in terms of F-measure for all three ROUGE         [8] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
metrics.                                                            hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
   Second, we implemented a pairwise Longformer                     Denoising sequence-to-sequence pre-training for
ranker for long documents and compared it to four                   natural language generation, translation, and com-
other ranking models on the COLIEE’20 benchmark data:               prehension, arXiv preprint arXiv:1910.13461 (2019).
BM25, LM, DRMM, and Vanilla BERT. We found how-                 [9] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
ever that BM25 outperforms all neural rankers on this               M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
task and that the Longformer ranker is outperformed by              limits of transfer learning with a unified text-to-
BERT.                                                               text transformer, arXiv preprint arXiv:1910.10683
   Third, we evaluated the merits of query document sum-            (2019).
marization for the BM25, BERT and Longformer rankers.          [10] I. Beltagy, M. E. Peters, A. Cohan, Longformer:
We found that summarizing the query document im-                    The long-document transformer, arXiv preprint
proves the quality of each of the rankers compared to us-           arXiv:2004.05150 (2020).
ing the original document. For BM25, we also compared          [11] S. MacAvaney, A. Yates, A. Cohan, N. Goharian,
the LED-summary to statistical query term extraction                Cedr: Contextualized embeddings for document
(KLI), and we found that summarization gives a higher               ranking, in: Proceedings of the 42nd International
recall, but term extraction gives a higher precision.               ACM SIGIR Conference on Research and Develop-
   Fourth, we showed the effectiveness of combining an              ment in Information Retrieval, 2019, pp. 1101–1104.
optimised BM25 ranker and a BERT ranker, outperformin-         [12] T. Tomokiyo, M. Hurst, A language model approach
ing the state of the art on two benchmark sets. We con-             to keyphrase extraction, in: Proceedings of the ACL
clude that retrieval for long query documents in legal              2003 workshop on Multiword expressions: analysis,
case retrieval can be helped by optimising lexical models,          acquisition and treatment, 2003, pp. 33–40.
automatic summarization, and a combination of both.            [13] J. Rabelo, M.-Y. Kim, R. Goebel, M. Yoshioka,
                                                                    Y. Kano, K. Satoh, A summary of the coliee 2019
                                                                    competition, in: JSAI International Symposium on
References                                                          Artificial Intelligence, Springer, 2019, pp. 34–49.
 [1] S. A. Lastres, Rebooting legal research in a digital      [14] V. Tran, M. L. Nguyen, K. Satoh, Building legal case
     age, 2015.                                                     retrieval systems with lexical matching and summa-
 [2] J. Rabelo, M.-Y. Kim, R. Goebel, M. Yoshioka,                  rization using a pre-trained phrase scoring model,
     Y. Kano, K. Satoh, COLIEE 2020: Methods for                    in: Proceedings of the Seventeenth International
     Legal Document Retrieval and Entailment, 2020.                 Conference on Artificial Intelligence and Law, 2019,
     URL: https://sites.ualberta.ca/~rabelo/COLIEE2021/             pp. 275–282.
     COLIEE_2020_summary.pdf.                                  [15] Y. Shao, J. Mao, Y. Liu, W. Ma, K. Satoh, M. Zhang,
 [3] D. Locke, G. Zuccon, H. Scells, Automatic query                S. Ma, BERT-PLI: Modeling Paragraph-Level Inter-
     generation from legal texts for case law retrieval, in:        actions for Legal Case Retrieval, in: Proceedings of
     Asia Information Retrieval Symposium, Springer,                the Twenty-Ninth International Joint Conference
     2017, pp. 181–193.                                             on Artificial Intelligence, IJCAI-20, ????
 [4] Y. Shao, B. Liu, J. Mao, Y. Liu, M. Zhang, S. Ma,         [16] A. Kanapala, S. Pal, R. Pamula, Text summarization
     Thuir@ coliee-2020: Leveraging semantic under-                 from legal documents: a survey, Artificial Intelli-
     standing and exact matching for legal case retrieval           gence Review 51 (2019) 371–402.
     and entailment, arXiv preprint arXiv:2012.13102           [17] N. Van de Luijtgaarden, Automatic Summariza-
     (2020).                                                        tion of Legal Text, Master’s thesis, Utrecht Univer-
 [5] V. Tran, M. Le Nguyen, S. Tojo, K. Satoh, Encoded              sity, 2019. URL: https://dspace.library.uu.nl/handle/
     summarization: summarizing documents into con-                 1874/384802.
     tinuous vector space for legal case retrieval, Artifi-    [18] J. Guo, Y. Fan, Q. Ai, W. B. Croft, A deep relevance
     cial Intelligence and Law 28 (2020) 441–467.                   matching model for ad-hoc retrieval, in: Proceed-
 [6] J. Rossi, E. Kanoulas, Legal information retrieval             ings of the 25th ACM international on conference
                                                                    on information and knowledge management, 2016,
     pp. 55–64.                                             [31] A. Trotman, A. Puurula, B. Burgess, Improvements
[19] I. Sekulić, A. Soleimani, M. Aliannejadi, F. Crestani,      to bm25 and language models examined, in: Pro-
     Longformer for ms marco document re-ranking                 ceedings of the 2014 Australasian Document Com-
     task, arXiv preprint arXiv:2009.09392 (2020).               puting Symposium, 2014, pp. 58–65.
[20] S. Hofstätter, H. Zamani, B. Mitra, N. Craswell, [32] A. Lipani, M. Lupu, A. Hanbury, A. Aizawa, Ver-
     A. Hanbury, Local self-attention over long text for         boseness fission for bm25 document length normal-
     efficient document retrieval, in: Proceedings of the        ization, in: Proceedings of the 2015 International
     43rd International ACM SIGIR Conference on Re-              Conference on the Theory of Information Retrieval,
     search and Development in Information Retrieval,            2015, pp. 385–388.
     2020, pp. 2021–2024.                                   [33] H. Zamani, M. Dehghani, W. B. Croft, E. Learned-
[21] S. Verberne, M. Sappelli, D. Hiemstra, W. Kraaij,           Miller, J. Kamps, From neural re-ranking to neural
     Evaluation and analysis of term scoring methods             ranking: Learning a sparse representation for in-
     for term extraction, Information Retrieval Journal          verted indexing, in: Proceedings of the 27th ACM
     19 (2016) 510–545.                                          international conference on information and knowl-
[22] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale-      edge management, 2018, pp. 497–506.
     tras, I. Androutsopoulos, Legal-bert: The mup-
     pets straight out of law school, arXiv preprint
     arXiv:2010.02559 (2020).
[23] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-
     riching word vectors with subword information,
     Transactions of the Association for Computational
     Linguistics 5 (2017) 135–146.
[24] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
     estimation of word representations in vector space,
     arXiv preprint arXiv:1301.3781 (2013).
[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     esnay, Scikit-learn: Machine learning in Python,
     Journal of Machine Learning Research 12 (2011)
     2825–2830.
[26] J. Frej, P. Mulhem, D. Schwab, J.-P. Chevallet, Learn-
     ing term discrimination, in: Proceedings of the 43rd
     International ACM SIGIR Conference on Research
     and Development in Information Retrieval, 2020,
     pp. 1993–1996.
[27] I. Chios, S. Verberne, Helping results assessment by
     adding explainable elements to the deep relevance
     matching model, 2020. URL: https://ears2020.github.
     io/accept_papers/2.pdf.
[28] J. Frej, D. Schwab, J.-P. Chevallet, Mlwikir: A
     python toolkit for building large-scale wikipedia-
     based information retrieval datasets in chinese, en-
     glish, french, italian, japanese, spanish and more
     (????).
[29] I. Guyon, A. Elisseeff, An introduction to variable
     and feature selection, Journal of machine learning
     research 3 (2003) 1157–1182.
[30] M. Taylor, H. Zaragoza, N. Craswell, S. Robertson,
     C. Burges, Optimisation methods for ranking func-
     tions with multiple parameters, in: Proceedings
     of the 15th ACM international conference on In-
     formation and knowledge management, 2006, pp.
     585–593.