1. Introduction

G. Comandé);

Automatic Rhetorical Roles Classification for Legal Documents using LEGAL-TransformerOverBERT

Gabriele Marino

gabriele.marino@santannapisa.it 0 1 2

Daniele Licari

daniele.licari@santannapisa.it 0 1 2

Praveen Bushipaka

praveen.bushipaka@santannapisa.it 0 1 2

Giovanni Comandé

giovanni.comande@santannapisa.it 0 1 2 0 Networks (CNN) , and Long-Short Term Memory, LSTM 1 Rhetorical Roles Classification, LEGAL-BERT , Hierarchical Transformers, LEGAL-ToBERT 2 Scuola Superiore Sant'Anna , P.zza dei Martiri della Libertà, Pisa, 56100 , Italy

2012

000 0 0002

Automatic identification of rhetorical roles can help in many downstream applications of legal documents analysis, such as legal decisions summarization and legal search. This is usually a complex task, even for humans, due to its inherent subjectivity and to the dificulty of capturing sentence context in very long legal documents. We propose a novel approach, based on Hierarchical Transformers, which overcomes these problems and achieves promising results on two diferent datasets of Italian and English legal judgments. Specifically, we introduce LEGAL-TransformerOverBERT (LEGAL-ToBERT), a model based on the stacking of a transformer encoder over a legal-domain-specific BERT model, and show that our approach is able to significantly improve the baselines set by the stand-alone LEGAL-BERT models, by capturing the relationships between diferent sentences of the same document. We make our models available and ready-to-use for downstream applications of rhetorical roles classification in the legal context both for the Italian and English language.

1. Introduction

Proceedings of the Sixth Workshop on Automated Semantic Analysis of that we named LEGAL-TransformerOverBERT (LEGAL

ToBERT). Our approach is based on the stacking of a

transformer encoder on top of a legal-domain-specific BERT model, creating a hierarchical architecture able to capture the discursive relationships between sentences, allowing accurate classification of rhetorical roles. We also propose a novel positional encoding strategy for the upper-layer transformer of ToBERT, based on the sinusoidal encoding of the relative position of a sentence in the document, and show that this is preferable when dealing with RRC in the legal context.

As a proof of the efectiveness of our approach, we

tested our model using two diferent datasets. The first one is a new yet confidential Italian-language dataset that we built specifically for this task and named ITA

RhetRoles; the second one is the English-language BUILD benchmark dataset [5]. We used respectively Italian

blocks for LEGAL-ToBERT. We then compared the re© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 LEGAL-BERT [9] and LEGAL-BERT [12] as building sults of LEGAL-ToBERT with those of the stand-alone Italian-LEGAL-BERT and LEGAL-BERT, and found that LEGAL-ToBERT allows for significantly better performances on both datasets, improving the baseline MCC respectively by 21% and 30%.

We make all our code and models publicly available and ready-to-use for downstream applications of legal RRC on our Rhetorical Roles Classification GitHub repository1.

2. Related Work

performances of standard transformers when dealing with long texts [19, 20, 21].

Our experiments address RRC using a hierarchical transformer architecture based on legal-domain-specific BERT models. To the best of our knowledge, this is the ifrst attempt combining these two colliding worlds and using them to build a refined model for RRC in the legal domain. Our models are available and ready-to-use both for the Italian and English language. This is the first time that a fine-tuned model is made available for RRC of legal documents for the Italian language: it is our sincere hope that this will enable many downstream applications, helping to speed up the work of Italian jurists.

In spite of the increasing research in applications of Artiifcial Intelligence to the legal domain, only limited works have focused on RRC. One of the earliest works with this 3. Methodology aim can be traced back to Hachey et al. [13], in which handmade annotated sentences were used to train tradi- 3.1. Rhetorical Roles Datasets tional Machine Learning algorithms such as Naive Bayes and SVM. Moens et al. [14] used Multinomial Naive We used two diferent datasets to compare the perforBayes classifiers and Maximum Entropy models to ad- mances of our hierarchical model with those of vanillla dress the problem of argument detection in legal texts, as BERT models. The first one is a novel dataset that we dea particular case of RRC. Saravanan et al. [15] employed veloped for this work and that we named ITA-RhetRoles, Conditional Random Fields (CRF) to automatize the RRC the second one is the BUILD benchmark dataset [ 5 ]. Taof legal documents and used the predicted rhetorical roles ble 1 shows an overview of the two datasets in terms of to rank each sentence and enable a subsequent extractive number of documents and total sentences; both datasets summarization task. More recently, a work by Ghosh et are described more in details in the following sections. al. [8] used Hierarchical BiLSTM classifiers with the adfdoitrioRnRCofoafCInRdFiatno ilmegparlojvuedgthmeesnttasn. dS-taalrotninegCfRroFmbatsheelirnee- Split #IDToAc-sRhet#RSoelnests #DocBsUIL#DSents sults of this work, Malik et al. [16] proposed a Multi Task VTaraliidn. 1104495 698,6,02102 22241 235,2,73542 Learning (MTL) framework based on the same Hierar- Test 294 18,288 30 2,879 chical BiLSTM with CRF model to significantly improve the classification scores. Another noteworthy work by Table 1 Walker et al. [ 6 ] investigated the use of ML and rule- Number of documents and total number of sentences for ITAbased approaches for RRC tasks, and interestingly found RhetRoles and BUILD datasets. that both approaches can lead to very promising results with a small dataset of manually labeled sentences.

With the advent of deep learning and transformer mod- 3.1.1. ITA-RhetRoles Dataset els [17], neural methods have been applied to RRC, sig- ITA-RhetRoles is a dataset of civil law Italian legal cases. nificantly improving the results with respect to previ- This dataset has been kept private as it was built under ous works. Bhattacharya et al. [18] experimented on a confidentiality agreement between Scuola Superiore cross-jurisdictional legal documents datasets with var- Sant’Anna and some Italian courts. ITA-RhetRoles conious models including Hierarchical BiLSTM and GRU sists of approximately 1,500 Italian legal documents, split with the addition of a CRF and with the integration of an into train, validation, and test set using the year and the attention mechanism. They compared these models with subject of the case as stratification keys. Figure 1 shows LEGAL-BERT [12], a legal-domain-specific pre-trained the dataset distribution in terms of documents length: transformer, which outperformed the other traditional the longest document of the dataset consists of 248 senmachine learning algorithms, suggesting to investigate tences. The labelled rhetorical roles are the 5 most comfurther in the direction of transformers applications to mon sections of an Italian civil judgment: ”Introduction” RRC in the legal domain. (INT), ”Conclusions of the parties” (CP), ”Summary of the

Some other works have shown how hierarchical trans- appealed judgment” (SAJ), ”Legal reasons” (LR), and ”Deformers architectures can be employed to improve the cisional content” (DC). These labels were extracted using regular expressions to identify the diferent sections in 1https://github.com/GM862001/RhetoricalRolesClassification the collected documents. Handmade validation was then

3.1.2. BUILD Dataset BUILD dataset is a corpus of legal judgment documents

from the Supreme Court of India, High Courts in diferent Indian states and some district-level courts. It consists of a publicly released train and validation set2 and a private test set. We used the public validation set as test set and split the original train set into a train and validation set. Figure 2 shows the dataset distribution in terms of documents length: the longest document of the dataset consists of 386 sentences. The labelled rhetorical roles are 13: ”Preamble” (PRE), ”Facts” (FAC), ”Ruling by Lower Court” (RLC), ”Issues” (ISSUE), ”Argument by petitioner” (ARGP), ”Argument by respondent” (ARGR), ”Analysis” (ANA), ”Statute” (STA), ”Precedent relied” (PRER), ”Precedent not relied” (PRENR), ”Ratio of the decision” (RAT), ”Ruling by Present Court” (RPC), ”None of the others” (NONE).

2https://github.com/Legal-NLP-EkStep/rhetorical-role-baseline

3.2. TransformerOverBERT (ToBERT) TransformerOverBERT (ToBERT) has a hierarchical architecture, shown in figure 3, based on the stacking of the following components: a BERT token-level encoder, a sentence-level positional encoder, a sentence-level encoder, and a prediction layer. The processing of a legal case starts with splitting the raw text of the document into sentences and tokenizing them. Each sentence is fed to the BERT token-level encoder and the pooled output for that sentence, i.e. the hidden representation of the [CLS] token output by BERT, is extracted.

The pooled outputs are gathered and fed to the positional layer to create a position-dependent encoding of each sentence in the document. These are then input into the sentence-level encoder. The output representations of this layer are finally fed to the prediction layer for rhetorical roles classification. Each of these components is described in the following sections.

3.2.1. Data Preprocessing

Before being input to ToBERT, each document is split into sentences. Each sentence is tokenized using the tokenlevel BERT tokenizer and then padded or truncated to a certain number of tokens. Documents are also padded with null sentences up to the length of the longest document of the train set (e.g., 386 for BUILD dataset and 284 for ITA-RhetRoles dataset), so to have a batch of input documents ℐ ∈ ℝ×× × , where is the number of documents, is the number of sentences for each document, and is the size of the token embeddings.

3.2.2. BERT Token-Level Encoder

Bidirectional Encoder Representations from Transformers (BERT) is a neural model based on the transformer architecture [17]. It uses self-attention, residual connections, and layer normalization to achieve state-of-the-art results in many diferent tasks, with the addition of a task-specific output layer as the only modification to the model architecture [22].

BERT-like models are usually pre-trained via self

supervised methods on large unlabelled corpora and then ifne-tuned for the specific task in a supervised fashion.

Our approach is not diferent, in that we leverage two

diferent pre-trained BERT models: Italian-LEGAL-BERT trained on huge legal datasets consisting of Italian and

English cases respectively: our training process aimed

only to fine-tune them for our RRC use case.

Specifically, it is used to obtain the hidden token representation [] it is fed with of each batch sequence. It means that batches of sentences ∈ ℝ × × and produces as output a set of document representations ℝ× , where is the hidden size of the specific BERT model used (e.g., 768 for LEGAL-BERT).

3.2.3. Sentence-Level Positional Encoder A specific positional encoder is used to add a piece of information to the representation of each sentence about its position in the document.

In this work, we focus on Sinusoidal Positional Embeddings. Let’s define the input document length (i.e. the bedding dimension as (e.g., 768 for Legal-BERT). For the t-th sentence representation ∈ ℝ of a document (with 0 ≤ < ), the output of a sinusoidal positional encoder is:

′ = + , ∈ ℝ (for 0 ≤ < ) is given by:

where the -th component of the embedding vector where egy.

and are weights that depend on the embedding strat

We tried two diferent approaches. The first one, which

we named Absolute Positional Embedding, is the same used in the original Transformer architecture [17], and uses the weights = = . The second one is a novel embedding strategy that we named Relative Positional [9] and LEGAL-BERT [12]. Both these models are pre- fact a correlation between the rhetorical role of a sen

In ToBERT, BERT is used as a token-level encoder. piece of information to the positional encoding of a sennumber of sentences in the document) as and the em- account the relationships between the sentences of each = { sin( ), if is even cos( ), if is odd =

1 10000 2 .

Embedding. This one takes into account the relative posi

tion of a sentence in the document and uses the weights = = 1000 , where is the length of the document to which the sentence belongs3. Basically, instead of encoding the absolute position of a sentence, this embedding strategy encodes the relative position of that sentence with respect to the length of the document in thousandths (‰), using standard Sinusoidal Positional Embeddings.

The idea behind this approach is that legal documents often have a repetitive rhetorical structure (introductory sentences always come first, followed by sentences summarizing the final decision, and so on). By a preliminary explorative data analysis we found that there exists in tence and its relative position in the document. This dependency might rely on the specific language and legal field of the document, but for sure including such a tence might add valuable hints for its correct rhetorical role classification.

3.2.4. Sentence-Level Encoder

The sentence-level encoder is a transformer model [17] with the same configuration of the transformer encoders of the token-level BERT encoder (768 hidden dimensions, 12 attention heads, GELU activation function, and so on), but with only 2 stacked encoder-layers. It is used to process the batch of document representations ∈ ℝ ×× output by the positional encoder. The output produced by this component has the same shape as its input and is a batch of document representations that takes into document. The advantage of using a transformer encoder over recurrent architectures like LSTMs is that of better capturing long-distance relationships between sentences, thanks to the multi-head attention mechanism. This algorithm involves four main steps: 1. Input: a document representation ∈ ℝ × where is the number of sentences of and is the model’s hidden size. 2. Linear transformations: the attention function 64( /

ℎ). is applied in parallel using ℎ = 12 attention heads. For each attention head , is projected into three diferent spaces: the key space ∈ ℝ× , the query space ∈ ℝ× , and the value space ∈ ℝ× . These projections are computed using learned weight matrices ∈ ℝ × , ∈ ℝ × , and ∈ ℝ × , where = = 3. Scaled dot-product attention and softmax : for each attention head , the attention scores are computed by taking the dot product of query and

3We do not take into account the padding sentences here.

= softmax ( × ⊤ )

√ 4. Output: after computing the output for each attention head, these are concatenated along

Multi-Head Attention layer.

their last dimension. Finally, a linear projection is applied using a learned weight matrix ∈ ℝ ℎ× × to obtain the final output of the

= concat( 1, 2, ..., ℎ ) This sentence-level multi-head attention mechanism allows the model to capture diferent types of relationships between sentences by learning separate attention patterns for each head. Instead, stacking multiple encoder layers allows to learn increasingly abstract representations of the input sequence. chitecture, this model includes a dropout layer as a regularization technique to prevent overfitting. Dropout randomly shuts down some of the neurons in the network during training, sampling from a Bernoulli distribution with some probability (which is equal to 0.1 in case of BERT), forcing the remaining neurons to learn more robust features that are not dependent on the presence of other units. 3.2.5. Prediction Layer sentations ′ ∈ ℝ×× mension , and then applying a softmax function to normalize the scores. Finally, the normalized scores are multiplied by the value matrix to obtain the attention output matrix : key , scaling by the square root of the key di- 4.1. Models Similar to the transformer encoders used in BERT ar- 4.2. Training and Hyperparameters

We used Italian-LEGAL-BERT [9] and LEGAL-BERT

[12] (the baselines models) to provide a baseline respectively for ITA-RhetRoles and BUILD datasets.

Specifically, each of them was chosen as the encoder

of an AutoModelForSequenceClassification from HuggingFace Transformers Python package [23]. We coupled each model with the relative AutoTokenizer, and we applied truncation and padding using = 64 as the max sentence length. As described in section 3.2.1, we also padded the documents with null sentences up to the length of the longest document for each dataset (386 sentences for BUILD dataset, 284 sentences for ITA

RhetRoles dataset). After having set a baseline for both datasets, we used the very same BERT models as the token-level encoders of ToBERT, and used ToBERT itself as the encoder of an

AutoModelForTokenClassification, keeping the same tokenizers and same truncation max sequence length. As sentence-level encoder we used 2 stacked encoder layers from PyTorch transformer model.

Fine-Tuning We trained all our models using a PyTorch linear scheduler based on AdamW optimizer, leveraging the Gradient Scaler from the CUDA Automatic Mixed Precision package. When training the baseline models we set the batch size to 128, while we used one document batches to train ToBERT. In both cases, we accumulated gradients every 3 steps. We set a maximum number of epochs to 20, but contextually using early stopping with 2 patience steps. All other relevant hyperparameters were The prediction layer input is the batch of document repre- fine-tuned. as output by the sentence-level

We used Optuna Python package for hyperparameters

encoder. This is fed to a linear layer with output units, fine-tuning [ 24]. This is an automated and eficient op being the number of labels (i.e. rhetorical roles), and timization framework ofering a versatile define-by-run then goes through a dropout layer for regularization pur- API for the hyperparameters space. is the rhetorical roles

When training our baseline models we considered the following hyperparameters space: poses. The final output ∈ ℝ ×× classification logits. If labels are provided (e.g. during training) this layer computes and returns the cross entropy loss between the logits and the labels, filtering out the inactive tokens (i.e. the padding ones).

4. Experiments Our experiments aimed to provide a baseline for both ITA-RhetRoles and BUILD datasets using legal-domain

specific BERT models and improve them using LEGAL

ToBERT. When evaluating our models we considered

the following metrics: accuracy, Matthew Correlation

Coeficient (MCC), micro and macro precision, micro and macro recall, micro and macro F1.

• Learning rate ∈ [5 − 6, 5 − 4] ; • Weight decay ∈ [1 − 3, 1 − 1] .

To these hyperparameters, we added the following ones when training ToBERT: • Sentence-level positinal embedding strategy

(S_lv_ pos_emb): either absolute or relative; • Sentence-level

encoder (S_lv_enc_dropout) ∈ [0.1, 0.7]; • Sentence-level encoder feed-forward network size (S_lv_enc_FFN_size) ∈ 50, 51, ..., 1000.

dropout

We used TPE (Tree-structured Parzen Estimator) al- both in overall terms and with respect to each considered gorithm proposed by Bergstra et al. [25] for hyperpa- rhetorical role. rameters optimization. This method has been shown to outperform many competiting ones, including ran- 5.1. ITA-RhetRoles dom search and grid search, in terms of eficiency and efectiveness. By fitting two separate Gaussian Mixture Table 3 lists the results of the best models selected by Models (GMMs) to the best and worst objective values, the hyperparameters fine-tuning process on the ITATPE estimates the density of the promising and unpromis- RhetRoles test dataset. LEGAL-ToBERT achieves almost ing regions separately, and guides the search accordingly. perfect score in each considered metric (all of them alOn each trial, TPE samples a new set of candidate hy- ways remain above 97%), significantly outperforming perparameters by maximizing the ratio ()/() , where LEGAL-BERT. In particular, LEGAL-ToBERT achieves () is the density estimate of ”good” hyperparameters macro F1 score of 0.98 and MCC of 0.972, improving the combinations and () is the density estimate of ”bad” baselines set by LEGAL-BERT by 12% and 21% respechyperparameters combinations. The candidate hyperpa- tively. rameters with the highest ratio are then evaluated using the objective function, and the process is repeated. Metric LEGAL-BERT LEGAL-ToBERT

For each dataset, we performed 32 search trials min- Accuracy 0.872 0.982 imizing the validation loss, and picked the best model MCC 0.806 0.972 tfeorrsfincaolmtebsintiantgi o.nTfaobrleLEG2 AshLo-BwEsRtTheanbdesLtEhGyApLer-TpoarBaEmReT- F1 MMaiccrroo 00..887782 00..998802 whIetnistrianitneerdesotinnIgTtAo-RnhoetitcReoltehsaatnind BbUotIhLDcadsaetsastehtes.rel- P MMaiccrroo 00..887712 00..997892 aotniveewehmebnetdrdaiinnginsgtrLaEteGgAyLw-TaosBpEreRfTe.reTdhtiso stuhgegaebsstosltuhtee R MMaiccrroo 00..888792 00..998802 efective usefulness of including relative position infor- Table 3 mation in the positional embeddings of the sentences, to Test results for ITA-RhetRoles dataset. leverage the correlation between this feature and their rhetorical role, due to the repetitive rhetorical structure of a legal document as a whole.

We evaluated our approach for legal RRC both on ITA- Table 5 lists the results of the best models selected by RhetRoles and BUILD dataset. Our analysis aims to com- the hyperparameters fine-tuning process on the BUILD pare the results of LEGAL-ToBERT with the baselines test dataset. LEGAL-ToBERT significantly outperforms provided by vanilla stand-alone LEGAL-BERT models, LEGAL-BERT in each considered metric. In particular, Dataset

ITARhet Roles BUILD

Model LEGAL-BERT LEGAL-ToBERT LEGAL-BERT LEGAL-ToBERT

Parameter Learning rate Weight decay Learning rate Weight decay S_lv_ pos_emb S_lv_enc_dropout S_lv_enc_FFN_size

Learning rate Weight decay Learning rate Weight decay S_lv_ pos_emb S_lv_enc_dropout S_lv_enc_FFN_size

Value 6.49e-05 5.35e-02 8.32e-05 6.93e-02 relative 0.26 167 7.03e-05 9.16e-02 7.54e-05 8.36e-02 relative 0.13 968

5. Results We also analyzed the performance of our method on

each rhetorical role separately. Table 4 shows the precision, recall, and macro F1 score for each rhetorical role.

In terms of macro F1 score, LEGAL-ToBERT achieves better performances for each rethorical role, apart from introductory sentences, for which the performances of the two models are comparable. Specifically, the improvement in terms of macro F1 scores ranges from 3% (DC decisional sentences) to 16% (SAJ - sentences summarizing the appealed judgment).

RR INT CP SAJ LR DC LEGAL-ToBERT achieves macro F1 score of 0.57 and The huge improvement in performances is imputable MCC of 0.73, improving the baselines set by LEGAL- to the capability of ToBERT models to deal efectively BERT by 22% and 30% respectively. with long documents, by considering and leveraging the relationships between the diferent sentences of the same Metric LEGAL-BERT LEGAL-ToBERT legal judgement. Other than this, the relative positional Accuracy 0.656 0.785 encoding strategy that we applied in the upper layer of

MCC 0.559 0.727 our hierarchical transformer allows our approach to take F1 MMaiccrroo 00..467526 00..577845 ionftothaeccinoduinvtidthuealcosrernetleanticoensbaentwdetehneitrhreerlahteitvoeripcaolsirtoiloens P MMaiccrroo 00..563526 00..672835 icnortrheectdcolacussmificeantito, nwohficahsepnrtoevnicdee,slefuvertrhagerinhginletgsaflodrotch-e R MMaiccrroo 00..465576 00..576845 umLeEnGtsArLe-pTeotBitEivReTrhreestourlticsaalrsetrpuacrttuicruel.arly surprising in Table 5 the case of ITA-RhetRoles dataset. This is most reasonTest results for BUILD dataset. ably due to the higher amount of data this is composed of, which allows a complex model like ToBERT to reach

We also analyzed the performance of our method on and exploit its maximum potential, and to the ease of each rhetorical role separately. Table 6 shows the preci- this task, given the repetitiveness of the structure of the sion, recall, and macro F1 score for each rhetorical role. documents used. The results achieved on the BUILD In terms of macro F1 score, LEGAL-ToBERT outperforms dataset are much worse in absolute terms, due to the LEGAL-BERT in almost each rhetorical role, apart from greater dificulty of the task (much more labels, much sentences asserting the petitioner arguments (ARGP), for less data), but the relative improvement introduced by which, surprisingly, LEGAL-BERT performs 12% better. ToBERT on the baseline is comparable, if not even better, The two models perform equally well on sentences about with respect to that achieved on ITA-RhetRoles (MCC not relied precedents (PRENR), sentences presenting the improves by 30% in the case of BUILD and by 21% in that the issue of the debate (ISSUE) and statute sentences of ITA-RhetRoles). (STA). The improvement achieved by LEGAL-ToBERT Such promising results invite to employ this model for all other rhetorical roles ranges from 4% (RPC - ruling architecture to automate RRC in related applications, givsentences by the present court) to 40% (ARGR - sentences ing high hopes of achieving relevant outcomes in many asserting the respondent argument). diferent legal document analysis tasks. 5.3. Discussion

Our experiments show that approaches to legal RRC based on LEGAL-ToBERT greatly improve the baselines set by vanilla stand-alone LEGAL-BERT models, in two diferent languages and legal contexts.

5.4. Limitations

While ToBERT models have shown impressive performance on legal RRC benchmarks, we want to highlight some of their main limitations.

ToBERT models are computationally expensive.

ToBERT models rely on a huge number of parameters, which makes training and fine-tuning much more computationally expensive than other competitive approaches, including CRFs and stand-alone BERT models. This can become a serious limit in terms of scalability and practicality of use in certain applications. For instance, dealing with very long documents (e.g., thousands of sentences) or with documents with very long sentences (e.g., many houndreds of tokens) could become unfeasible without very powerful computational resources, both in terms of time and space complexity.

ToBERT models require high availability of annotated data. When running experiments on very small datasets (less than 100 documents), we did not find any advantage in using ToBERT compared to vanilla BERT.

These and other experimental results suggest that the efectiveness of automated legal RRC using supervised NLP models is highly afected by the size and complexity of the dataset and the quality of the annotations. The and 22% and the MCC by 21% and 30% respectively. need for such approaches to have big and high-quality Future research should aim to extend and improve the datasets is very restricting, as the availability of such approach proposed to other domains and languages. It is datasets in the legal context is particularly limited for also important to address the problem of building robust privacy and discretionality reasons. frameworks in absence of large dataserts, which is most

ToBERT models do not generalize well to docu- often the case when dealing with the legal domain. On ments longer than those seen during training. For the other hand, we hope that the constant progress in architectural reasons, ToBERT models are unable to man- legal NLP will incentivize the collection and the release age efectively documents longer than those seen during of increasingly large datasets. Finally, our models are training. publicly available and ready-to-use, and we ourselves

ToBERT models may lack interpretability. A hi- plan to leverage them to enable and improve many downerarchical use of transformer-based models introduces stream applications such as summarization and argument a further layer of complexity which makes it even more mining of legal documents. challenging to interpret model decisions, leading to dificoulties in identify and diagnose errors or biases in model predictions. Acknowledgments ttrilaiLinnEgeGudAalalLns-TguuopaBpgEoerR-tsT.pLemEciGoficdALeLEl-GsTsAouBLfeE-rBRfETrRomTmomdleiomldseirtlesel,ydwomhniucplhr-e- (TAhgisilewJoursktiicse)pparrtojoefcItt4a, lfiuanndneadtiboynwthiedeIta”lGiaiunsMtiziniaisAtrgyiloef” makes it dificoult to apply this approach to multilingual Justice. or cross-language tasks. Deploying such models is not easy as it requires the fine-tuning of a BERT model us- References ing a huge amount of legal documents in the considered language. Still, our hope is that the availability of legal domain-specific pre-trained models will quickly improve with time, breaking new grounds in many diferent languages.

6. Conclusion and Future Work In this work we introduced LEGAL

TransformerOverBERT (LEGAL-ToBERT), a novel approach to legal rhetorical roles classification that leverages the power of Hierarchical Transformers and legal-domain-specific BERT models. We also proposed a novel embedding strategy for the top layer encoder of LEGAL-ToBERT, based on the sinusoidal encoding of the document sentences using their relative position in the document instead of the absolute one. Our results provide evidence that this approach allows for a robust and efective framework able to classify eficiently the rhetorical roles of the sentences of long legal documents by taking into account the relationships between them.

We tested the efectiveness of LEGAL-ToBERT on two diferent datasets. The first one is ITA-RhetRoles, a novel yet confidential dataset, consisting of thousands of documents from the Italian Civil Court Corpus; the second one is the BUILD benchmark dataset, composed of a couple of hundred documents from a various set of Indian courts. This allowed us to diversify our experiments in terms of both language and topic. We showed that LEGAL-ToBERT significantly outperforms vanilla standalone LEGAL-BERT models, on both ITA-RhetRoles and BUILD datasets, improving the macro F1 score by 12%

4More information is available at https://www.unitus.it/it/unitus/

mappatura-della-ricerca/articolo/giustizia-agile. ference on Semantic Computing (ICSC), IEEE, 2020, mation processing systems 30 (2017). pp. 464–467. [18] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, [8] S. Ghosh, A. Wyner, Identification of rhetorical A. Wyner, Deeprhole: deep learning for rhetorical roles of sentences in indian legal judgments, in: role labeling of sentences in legal case documents, Legal Knowledge and Information Systems: JURIX Artificial Intelligence and Law (2021) 1–38. 2019: The Thirty-second Annual Conference, vol- [19] J. Lu, M. Henchion, I. Bacher, B. M. Namee, A ume 322, IOS Press, 2019, p. 3. sentence-level hierarchical bert model for docu[9] D. Licari, G. Comandè, ITALIAN-LEGAL-BERT: ment classification with limited labelled data, in: A Pre-trained Transformer Language Model for Discovery Science: 24th International Conference, Italian Law, in: D. Symeonidou, R. Yu, D. Ce- DS 2021, Halifax, NS, Canada, October 11–13, 2021, olin, M. Poveda-Villalón, D. Audrito, L. D. Caro, Proceedings 24, Springer, 2021, pp. 231–241. F. Grasso, R. Nai, E. Sulis, F. J. Ekaputra, O. Kutz, [20] I. Chalkidis, X. Dai, M. Fergadiotis, P. Malakasiotis, N. Troquard (Eds.), Companion Proceedings of D. Elliott, An exploration of hierarchical attention the 23rd International Conference on Knowledge transformers for eficient long document classificaEngineering and Knowledge Management, vol- tion, arXiv preprint arXiv:2210.05529 (2022). ume 3256 of CEUR Workshop Proceedings, CEUR, [21] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, Bozen-Bolzano, Italy, 2022. URL: https://ceur-ws. N. Dehak, Hierarchical transformers for long docuorg/Vol-3256/#km4law3, iSSN: 1613-0073. ment classification, in: 2019 IEEE automatic speech [10] P. Henderson, M. S. Krass, L. Zheng, N. Guha, recognition and understanding workshop (ASRU), C. D. Manning, D. Jurafsky, D. E. Ho, Pile of IEEE, 2019, pp. 838–844. law: Learning responsible data filtering from the [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, law and a 256gb open-source legal dataset, 2022. Bert: Pre-training of deep bidirectional transformarXiv:2207.00220. ers for language understanding, arXiv preprint [11] A. Chriqui, I. Yahav, I. Bar-Siman-Tov, Legal hebert: arXiv:1810.04805 (2018).

A bert-based nlp model for hebrew legal, judicial [23] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Deand legislative texts, SSRN preprint:4147127 (2022). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun[12] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, tras, I. Androutsopoulos, LEGAL-BERT: The mup- Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, pets straight out of law school, in: Findings M. Drame, Q. Lhoest, A. M. Rush, Transformers: of the Association for Computational Linguistics: State-of-the-art natural language processing, in: EMNLP 2020, Association for Computational Lin- Proceedings of the 2020 Conference on Empirical guistics, Online, 2020, pp. 2898–2904. URL: https:// Methods in Natural Language Processing: System aclanthology.org/2020.findings-emnlp.261. doi:10. Demonstrations, Association for Computational 18653/v1/2020.findings- emnlp.261. Linguistics, Online, 2020, pp. 38–45. URL: https:// [13] B. Hachey, C. Grover, A rhetorical status classifier www.aclweb.org/anthology/2020.emnlp-demos.6. for legal text summarisation, in: Text Summariza- [24] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, tion Branches Out, 2004, pp. 35–42. Optuna: A next-generation hyperparameter opti[14] M.-F. Moens, E. Boiy, R. M. Palau, C. Reed, Au- mization framework, in: Proceedings of the 25th tomatic detection of arguments in legal texts, in: ACM SIGKDD International Conference on KnowlProceedings of the 11th International Conference edge Discovery and Data Mining, 2019. on Artificial Intelligence and Law, ICAIL ’07, Asso- [25] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, ciation for Computing Machinery, New York, NY, Algorithms for hyper-parameter optimizaUSA, 2007, p. 225–230. URL: https://doi.org/10.1145/ tion, in: J. Shawe-Taylor, R. Zemel, P. Bartlett, 1276318.1276362. doi:10.1145/1276318.1276362. F. Pereira, K. Weinberger (Eds.), Advances in [15] M. Saravanan, B. Ravindran, Identification of Neural Information Processing Systems, volrhetorical roles for segmentation and summariza- ume 24, Curran Associates, Inc., 2011. URL: https: tion of a legal judgment, Artificial Intelligence and //proceedings.neurips.cc/paper_files/paper/2011/ Law 18 (2010) 45–76. file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf. [16] V. Malik, R. Sanjay, S. K. Guha, A. Hazarika,

S. Nigam, A. Bhattacharya, A. Modi, Semantic segmentation of legal documents via rhetorical roles, 2022. arXiv:2112.01836. [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,

L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural infor

[1]

Farzindar , G. Lapalme, Letsum, an automatic legal text summarizing system , JURIX ( 2004 ) 11 - 18 .

[2] I. Nejadgholi ,

Bougueng ,

Witherspoon , A semisupervised training method for semantic search of legal facts in canadian immigration cases ., in: JURIX, 2017 , pp. 125 - 134 .

[3]

Savelka ,

K. D.

Ashley , Segmenting us court decisions into functional and issue specific parts ., in: JURIX , 2018 , pp. 111 - 120 .

[4]

Hachey ,

Grover , Extractive summarisation of legal texts, Artificial Intelligence and Law 14 ( 2006 ) 305 - 345 .

[5]

Kalamkar ,

Tiwari ,

Agarwal ,

Karn ,

Gupta ,

Raghavan ,

Modi , Corpus for automatic structuring of legal documents , in: Proceedings of the Thirteenth Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2022 , pp. 4420 - 4429 . URL: https://aclanthology.org/ 2022 . lrec- 1 . 470 .

[6]

V. R.

Walker ,

Pillaipakkamnatt ,

A. M.

Davidson ,

Linares ,

D. J.

Pesce , Automatic classification of rhetorical roles for sentences: Comparing rulebased scripts with machine learning ., ASAIL@ ICAIL 2385 ( 2019 ).

[7]

S. R.

Ahmad ,

Harris , I. Sahibzada , Understanding legal documents: classification of rhetorical role of sentences using deep learning and natural language processing , in: 2020 IEEE 14th International Con-