1. Introduction

Enhancing Pre-Trained Language Models with Sentence Position Embeddings for Rhetorical Roles Recognition in Legal Opinions

Anas Belfathi

anas.belfathi@edu.univ-paris13.fr 0

Nicolas Hernandez

nicolas.hernandez@univ-nantes.fr 0

Laura Monceaux

laura.monceaux@univ-nantes.fr 0

Role, Sequence Labelling

0 Nantes Université, École Centrale Nantes , CNRS, LS2N, UMR 6004, F-44000 Nantes , France

The legal domain is a vast and complex field that involves a considerable amount of text analysis, including laws, legal arguments, and legal opinions. Legal practitioners must analyze these texts to understand legal cases, research legal precedents, and prepare legal documents. The size of legal opinions continues to grow, making it increasingly challenging to develop a model that can accurately predict the rhetorical roles of legal opinions given their complexity and diversity. In this research paper, we propose a novel model architecture for automatically predicting rhetorical roles using pre-trained language models (PLMs) enhanced with knowledge of sentence position information within a document. Based on an annotated corpus from the LegalEval@SemEval2023 competition, we demonstrate that our approach requires fewer parameters, resulting in lower computational costs when compared to complex architectures employing a hierarchical model in a global-context, yet it achieves great performance. Moreover, we show that adding more attention to a hierarchical model based only on BERT in the local-context, along with incorporating sentence position information, enhances the results.

Pre-trained language models Discourse structure modeling Legal Opinions Sentence Positional Embeddings Rhetorical

1. Introduction

Pre-trained language models, such as BERT [1] and GPT3 [2], have shown significant improvements in performance across various Natural Language Processing (NLP) tasks. However, when it comes to apply these models to specific domains like legal documents, unique challenges arise. Legal documents are often lengthy and without explicit structure, requiring the identification of coherent parts, known as Rhetorical Roles (RRs) for tasks such as summarization, information extraction, and legal reasoning [3, 4, 5].

In this research work, we are interested in the task of rhetorical role prediction in legal judgements. In that context, examples of RRs are: PREAMBLE (meta-data related to the legal judgment document), FACTS (chronology of events that led to filing the case), RPC (final decisions ruled by the present court), etc. In particular, we work with the dataset provided by the organizers of the SemEval 2023 LegalEval competition for the rhetorical role prediction task1. For this task, Hierarchical Sequential

Labelling Network (HSLN) [6, 4, 7, 8] and Pre-trained Language Models (PLM) which are able to handle long

Proceedings of the Sixth Workshop on Automated Semantic Analysis of CEUR htp:/ceur-ws.org ISN1613-073 https://sites.google.com/view/legaleval. ple text statistics on the data show that a text contains on average 4346.07 sub-words2 (±2151.08) and therefore exceeds the maximum input length any Pre-trained Language Models can handle. In addition, the best current system does not exceed 87% of F1-score which justifies an interest in the task [12].

Our main contributions are the following:

• we enhance the pre-trained language model BERT with sentence position information at input; • we study the sentence position information under various representations (absolute, normalized and K–quantile); • we consider two architectures to contextualize the sentence representations: 1) a single BERT encoder and 2) a hierarchical model made of a BERT encoder layer to encode sentences and a shallow encoder (with two Transformer layers) to contextualize a sentence with its surrounding sentences; • we evaluate these various models in the context of a rhetorical role sequence labelling task for legal judgments.

Our related work section (Section 2) covers various

topics from the fusion of discourse information in language models to some rhetorical role prediction system

2Computed with the BERT tokenizer.

architectures including a position embedding presenta- Apart from these more complex architectures, [22] tion. Then, in Section 3, we detail the models we propose. showed that, for a text segmentation task, looking at In Section 4, we present the methodology we use for our the local context around each candidate break (by taking ifne-tuning evaluation. Finally, we present our result and the end of the previous sentences and the beginning of discuss our future work in the last sections. the following as input) is suficient to obtain comparable Our code will be publicly available on MASKED_URL. performance to the HAN architectures.

Rhetorical Role Prediction The analysis of the rhetorical text4 structure includes several tasks [24, 25, 2. Related work 26, 8, 12]: 1) text segmentation into text rhetorical units, 2) rhetorical role identification of each text unit, 3) strucInjecting discourse information in language mod- ture prediction, which links the text units together and els Legal texts share linguistic characteristics specific to 4) relations labelling to name the connections. We focus the legal domain (and often to a legal sub-domain). They here in the task of rhetorical role identification which has have legal jargon, long sentences, unusual word order been considered in the literature as a sequence labelling and long length [13, 14, 8, 15]. These characteristics do task taking the sentence as a minimal text unit [12]. not allow to take full advantage of the state-of-the-art As reported by [12], the Hierarchical Sequential Lalanguage models trained on the general domain and even belling Network (HSLN) remains the most eficient archishow their limitations since most of them cannot handle tecture for this task (at least for the LegalEval dataset) a text length which goes beyond their maximum input [6, 4, 7, 8]. The model first encodes sentence representalength. tions (e.g. by using sent2vec [27] or any PLM like BERT

Transformers [16] sufer from a quadratic computa- or LegalBERT). Then it contextualizes the sentence repretional and memory complexity with respect to the se- sentations (e.g. through a BiLSTM layer) and eventually quence length. This lead most of the SOTA models (e.g. predicts the label sequence thanks to a sequence labelling BERT [1], RoBERTa [17], LegalBERT [14]) to adopt 512 layer (like a CRF). On the LegalEval dataset, best perforas their maximum sequence length. mance were obtained by participants who used domain

Pre-training or retraining with discourse-based objec- adaptation techniques like a pre-train language model tives can result in sentence and text representation that trained on Legal text, or augmented datasets. The baseare more adapted to addressing NLP tasks at the dis- line model were based on the HSLN architecture and had course level [18, 19, 20] but this kind of approach does a performance of 79% F1 score. The proposed methods not address directly the limitation of the input length. show an improvement over the baseline without ever suc

In terms of neural architecture adaptation, Hierarchi- ceeding to outperform by more than seven points. [28] cal Attention Networks (HANs) have been proposed to showed that a single BERT can be suficient to capture model a sequence of sentences by stacking two layers contextual dependencies without the need for hierarchiof encoders: one to capture the word sequences and an- cal contextual encoding neither a CRF sequence labeler. other (taking the former as input) to capture the sentence The approach uses BERT to encode a concatenation of sequence. This architecture has been shown to perform sentences (fixed at 10 sentences) and use a MLP over each significantly better than single layer encoders for text encoded sentence separator token to predict the correclassification [ 21], text segmentation [22], recommenda- sponding sentence label. Despite its low complexity, the tion [20] and sequence labelling tasks [8]. To avoid an approach is limited by the length of the input sequence important number of padded words in the first layer, [ 20] and requires to tile the whole text to obtain the whole proposed to concatenate as many as natural sentences label sequence. [8] did not confirm the efectiveness of the input block can fit. Although extending the scope of the method on the LegalEval dataset. sniontgsleoltvreantshfeorcmomerpelenxciotdyepr,rtohbeleHmAsNanadrchhaitsetcotuprerodcoeesss chiPteocstiutiroenisenmobtesdendsinitgivseBtyo nthateuorer,dtehreoTfriannpsufotrtmokerenars-. ifxed length of sentence sequence and truncate too long To make the model position-aware, the position infordocuments. Recent architectures such as Longformer mation of the input words is typically added as an ad[9], BigBird [10] or Ernie [11] succeeded in extending to ditional embedding to the input token embeddings [16]. 4096 the maximum sequence length while reducing the While absolute sinusoidal position encodings were utiTransformer’s complexity3 by introducing sparsity into lized in the vanilla Transformer, some works showed that attention layers (i.e. by allowing each token position to learned position embeddings can provide more flexibility attend to a subset of token positions with respect to some in adapting to diferent tasks through back-propagation sparse patterns). To improve such models, [15] suggested [1, 29], instead of using hand-crafted position represento consider the actual logical structure of documents. 3See [23] for a survey of techniques to address Transformer’s limita- 4We do not make a distinction here between the intra- and intertions. sentencial levels. tations [30, 31]. Multiple works explored also diferent token position information (absolute, relative) and ways to include it in Transformers (e.g. in the input or the attention matrix) [30, 32]. Very few were interested in sentence position information. [20] indicate to fuse sen- Figure 2: Average sentence position variation by label on the tence block representations with sentence block position dataset. embeddings but without mentioning precisely the nature of the position (relative, absolute...) and how the position table is built. be crucial for identifying the RR of a particular sentence.

BERT Input Representation BERT is one of the To address this limitation, we propose the addition of models that utilizes three types of learned embedding sentence position embeddings to the BERT embedding, layers (See Figure 1): Token Embeddings, Segment Em- which aims to enhance the performance of RR prediction beddings, and Position Embeddings [1]: in legal opinions.

• Token Embeddings: This layer is responsible for converting each word in the input text into a fixeddimensional vector representation. In the BERT Base model, each word is represented as a 768dimensional vector. • Segment Embeddings: This layer has the task of distinguishing between the inputs in a given pair by assigning one of two vector representations to each token in the input. • Position Embeddings: BERT takes into account the sequential nature of input sequences by learning a vector representation for each token position in the input. The Position Embeddings layer is a lookup table of size (512, 768), where each row represents the vector representation of a word at a specific position in the input sequence.

3. Fusion sentence position embeddings at input

In this section, we focus on the approach that we have developed for injecting discursive knowledge without pre-training through input embeddings. The use of input embeddings has been a popular approach in natural language processing (NLP) for representing text data in a high-dimensional vector space. By incorporating discursive knowledge into these embeddings, we aim to improve the performance of NLP models without the need for pre-training.

3.1. BERT-SentPos: BERT sentence encoder enhanced with sentence position embeddings at input

The representations from Token Embeddings, Seg- In the analysis we have conducted on legal documents, ment Embeddings, and Position Embeddings are summed we have observed that each rhetorical role has a specific element-wise to produce a single representation with position within the document (See Figure 2). For example, shape (1, n, 768), where is the length of the input se- the preamble role is found at the beginning, the analysis quence. This combined representation captures contex- role in the middle of the document, and the RPC role tual information of tokens in the input text. Although at the end. To improve our model’s performance, we BERT has shown efective results for many tasks such as decided to incorporate additional information that indiquestion answering and sentiment analysis of tweets, it is cates the position of each sentence in the document. We non-performant when working with lengthy documents. achieved this by adding an extra embedding layer to the However, it does not incorporate any information about BERT embedding, which helped us capture the sequenthe position of sentences within a document, which can tial nature of positions in vector space. Various Position Embeddings (PEs) have been proposed in Transformer based architectures [16] to ”capture the sequential nature of positions in vector space.” These PEs range from fixed ad-hoc ways to fully learnable ones. BERT, in particular, uses fully learnable PEs. In the interest of simplicity, we decided to reuse the learned PEs to represent the sentence Position Embeddings. By doing so, we were able to create a more accurate and efective model for analyzing legal documents.

In this research, we employed various techniques to examine the positioning of sentences within legal documents. Specifically, we explored three diferent ways of analyzing sentence positions, including absolute position, normalized position, and k-quantile position :

Absolute position refers to the location of a sentence within a particular document in relation to other sentences in the same document. For instance, we may have a document that includes: · Sentence 1: ”The court hereby orders the defendant Figure 3: BERT-SentPos: A BERT Architecture with Sentence to appear for a hearing on Monday.” Position Embeddings fused at the BERT input layer using the · Sentence 2: ”The defendant shall provide all relevant Learnable Position Embedding of Tokens.

documents to the plaintif’s attorney by Friday.”

In this case, Sentence 1 has an absolute position of being the first sentence in the document, while Sentence in document length and unique writing styles of judges. 2 has an absolute position of being the second sentence. Figure 3 illustrates our proposed architecture, which

Normalized position refers to the process of con- utilizes the BERT encoder. We input pairs of consecuverting the position of a sentence in one document to tive sentences denoted as sentence and sentence + 1 , a corresponding position in the largest document in a along with the corresponding sentence position of precorpus that has the maximum length of sentences. This is dicted sentence , to enable our model to learn contexdone to ensure consistent normalization across diferent tualized representations that consider the relationships documents, by aligning sentence positions with respect between neighboring sentences. To capture positional to a common reference point. For example, let’s consider information, we added BERT embeddings with sentence Document A, which consists of 50 sentences, and Docu- position embeddings. These embeddings provide the ment B, the largest document in the corpus, containing model with position information of each sentence within 100 sentences. If we take Sentence 25 from Document A, the document. The combined representation is passed we can calculate its normalized position as (25 multiplied through encoders and feed-forward layers for predictby 100 divided by 50) = 50. This aligns Sentence 25 with ing the rhetorical role of sentence . At the end we the same relative position in Document B. obtained three models: BERT enhanced with Absolute

K–quantile position The analysis of legal documents Sentence Position Embeddings (BERT-AbsPos), BERT often involves addressing variations in document length enhanced with 8-Quantile Sentence Position (BERT-8and unique writing styles of judges. This technique in- QuantilePos), BERT enhanced with Normalized Sentence volves dividing the documents into k parts or quantiles Position (BERT-NorPos). to help control the absolute position of each rhetorical role. For example: We have an Document X (divided into 3.2. HiBERT: Hierarchical variant to 4 quantiles): · Quantile 1: ”The court presents the background of contextualize sentence the case.” representations · Quantile 2: ”The court discusses relevant legal prece- In this section, we present HiBERT, a hierarchical variant dents.” of BERT-SentPos described in Section 3.1 (See Figure 4). · Quantile 3: ”The court analyzes the evidence pre- The model is based on the [14, 33]’s hierarchical model.

sented by both parties.” The model aims to label a sentence according to the sen· Quantile 4: ”The court renders its final decision and tences that precede and follow at a certain range. Each issues the judgment.” sentence representation is enhanced with sentence posiBy dividing the opinion into quantiles, each containing tional information before being encoded by a BERT ena specific role or aspect, it helps in addressing variations

4.1. The LegalEval Dataset We utilized the data supplied by Sub-task A “Rhetorical

Roles Prediction” of the SemEval 2023 Task 6 “LegalEval Understanding Legal Texts” challenge5. The dataset comprises Indian legal data extracted from court judgments coder. The process generates a top-level representation, and includes 13 diferent RRs, with the details and definidenoted as [] , for each sentence. These sentence tions for each RR provided in the article by Kalamkar et representations are then fed into a 2-layered transformer al. [8]. The average number of sentences per document encoder to contextualize a sentence sequence. This en- is 117.31. coder follows the same specifications as BERT, including To prepare the dataset, we kept the same LegalEval hidden units and the number of attention heads. Even- split (train and validation data). We used the 90% of the tually we utilize a dense layer to predict the label of the train data to train and the remaining 10% to validate the sentence in focus in the current sentence sequence. model. Furthermore, we used the original validation data

We have set a window of surrounding sentences to to evaluate the performance of the trained models. The a maximum of ±7 sentences. This range size corre- statistics for splitting the corpus are shown in Table 1. sponds to the average number of consecutive sentences with the same RR label in our train dataset. Four vari- 4.2. Hyperparameters ations of HiBERT were experimented: HiBERT-AbsPos, which uses Absolute Sentence Positions with the max- For the fine-tuning setup, hyperparameters are deterimum window size; HiBERT-NorPos, which uses Nor- mined through experimentation and analysis. The batch malized Sentence Position with the maximum window size is set to 8, taking into consideration the available size; HiBERT-AbsPosHalf, which uses Absolute Sentence computational resources and model performance. The Position with half of the maximum window size; and learning rate is set to 2e-5, which is a commonly used HiBERT-NorPosHalf, which uses Normalized Sentence value for fine-tuning NLP models. The epoch number Positions with half of the maximum window size. Over- is chosen from {1, 2, 3}, with the final epoch number all, the hierarchical model with more attention attention selected based on a balance between training time and is an efective solution for processing legal documents model performance for each model. that consist of thousands of words. By using a hierarchical approach and taking into account the context of 4.3. Performance measures the text, the model is able to efectively process lengthy documents and make accurate classifications based on the document’s content.

4. Experimental methodology We evaluate our contributions in the fine-tuning phase of pre-trained models. The performance of the NLP models for the rhetori

cal roles task is assessed using Weighted-Precision ( ), Weighted-Recall ( ), Accuracy ( ), Weighted-F1 ( 1 ) and Macro F1 ( 1 ) scores based on the hidden test set. The weighted F1 score considers both precision and recall, and it is calculated by taking into account the class-wise F1 scores weighted by the number of samples in each class. The Macro F1 score provides an overall assessment of model performance by calculating the F1 score for each class independently and then taking the average. 0.79 0.81 0.81 0.79 0.57 a smaller context window can be explained by the fact that the semantics of a sentence are usually more dependent on local context than on knowing all sentences in a paragraph, for example.

Overall, our experiments demonstrated that using different types of sentence position information can significantly improve the performance of BERT on legal document classification tasks. Additionally, a hierarchical model that combines BERT with absolute position information and a window size of 4 sentences (⌈7/2⌉) can further enhance the performance of our proposed models. Furthermore, our approach also achieved low computational time, making it eficient and practical for real-world applications.

5. Results

In this section, we present the experimental results of our study on injecting diferent types of sentence position embeddings in the BERT model (See Table 2). As mentioned earlier, we used the weighted F1 score as our primary performance metric. As a baseline, we report the score obtained by the BERT-HSLN6 model [12] which achieved a performance of 79%. Our first experiment was conducted using BERT with three types of sentence position embeddings: Absolute position (BERT-AbsPos),

Normalized position (BERT-NorPos), and K-quantile po

sition(BERT-8-QuantilePos). Our results revealed that BERT performs comparatively poorly (65%) compared to the proposed models when considering three types of position. For the K-quantile position, we experimented with diferent numbers of parts and found that the best division based on performance is with 8 parts. However, we found that BERT with normalized position achieved a better score of 75%. We attribute this improvement to our eforts in controlling the variation in length across diferent legal documents.

To further enhance the performance of the recogni

tion of RRs, we experimented with a hierarchical model that combines BERT with more attention and a window size equal to the average of consecutive sentences with the same RR label, while also injecting sentence posi

Unfortunately, we did not observe any significant im

provement of the results. Subsequently, we halved the average window size of sentences to take into consideration (HiBERT-AbsPosHalf and HiBERT-NorPosHalf). This led to an improvement in the results, particularly with Absolute position by 79%. We attribute this success to the fact that absolute position captures the specific position of a sentence within a document, providing crucial contextual information of the predicted sentence based on its surrounding sentences. This improvement with

6This work is part of OpenNyAI https://opennyai.org/ mission, which

is funded by the EkStep Foundation https://ekstep.org/. tion information (HiBERT-AbsPos and HiBERT-NorPos). the fact that current state-of-the-art models may not be

6. Conclusion and Future Work The results of our study indicate that the inclusion of

absolute, normalized, and k-quantile positional embeddings can significantly improve the performance of both BERT base and its hierarchical variant for the Rhetorical Roles Prediction Task. However, there is a room for improvement. It should be noted, for example, that one potential limitation of this approach is that the number of sentences in a given document may exceed the dimensions of the learnable embedding matrix. The framework proposed by [30] can help us to determine the interest of learning dedicated sentence position embeddings or using sinusoidal PEs. In addition, in the context of token position embeddings, [32] showed that encoding position to attention matrix per-head results in superior performance comparing to adding position embeddings to the input. A similar experiment could be conducted on the sentence position embeddings. Looking ahead, our future work aims to develop a new architecture with a greater number of layers that incorporates additional features such as metadata pertaining to each legal opinion.

Furthermore, we plan to explore the potential benefits of

pre-training instead of fine-tuning and fine-tuning with LegalBERT [14], a BERT-based model pre-trained on legal text. Our motivation for this direction is based on suficient to achieve optimal performance for legal NLP tasks. By incorporating these improvements, we hope to develop a more robust architecture that achieves optimal performance for the Rhetorical Roles Prediction Task and other legal NLP tasks.

Acknowledgments This research was funded, in whole or in part, by l’Agence Nationale de la Recherche (ANR), project ANR-22-CE380004.

tomatic structuring of legal documents, in: Proceedings of the Thirteenth Language Resources [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: and Evaluation Conference, European Language Pre-training of deep bidirectional transformers for Resources Association, Marseille, France, 2022, language understanding, in: Proceedings of the pp. 4420–4429. URL: https://aclanthology.org/2022. 2019 Conference of the North American Chap- lrec-1.470. ter of the Association for Computational Linguis- [9] I. Beltagy, M. E. Peters, A. Cohan, Longtics: Human Language Technologies, Volume 1 former: The long-document transformer, 2020. (Long and Short Papers), Association for Com- a r X i v : 2 0 0 4 . 0 5 1 5 0 . putational Linguistics, Minneapolis, Minnesota, [10] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, 2019, pp. 4171–4186. URL: https://aclanthology.org/ C. Alberti, S. Ontanon, P. Pham, A. Ravula, N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . Q. Wang, L. Yang, A. Ahmed, Big bird: Trans[2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- formers for longer sequences, in: H. Larochelle, plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- M. Ranzato, R. Hadsell, M. Balcan, H. Lin try, A. Askell, et al., Language models are few-shot (Eds.), Advances in Neural Information Prolearners, Advances in neural information process- cessing Systems, volume 33, Curran Associates, ing systems 33 (2020) 1877–1901. Inc., 2020, pp. 17283–17297. URL: https:// [3] M. Saravanan, B. Ravindran, S. Raman, Auto- proceedings.neurips.cc/paper_files/paper/2020/file/ matic identification of rhetorical roles using con- c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. ditional random fields for legal document summa- [11] X. Ouyang, S. Wang, C. Pang, Y. Sun, H. Tian, rization, in: Proceedings of the Third International H. Wu, H. Wang, ERNIE-M: Enhanced multiJoint Conference on Natural Language Processing: lingual representation by aligning cross-lingual Volume-I, 2008. semantics with monolingual corpora, in: Pro[4] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, A. Z. ceedings of the 2021 Conference on Empirical Wyner, Identification of rhetorical roles of sen- Methods in Natural Language Processing, Assotences in indian legal judgments, in: The Thirty- ciation for Computational Linguistics, Online and second International Conference on Legal Knowl- Punta Cana, Dominican Republic, 2021, pp. 27–38. edge and Information Systems (JURIX), volume 322, URL: https://aclanthology.org/2021.emnlp-main.3. 2019, p. 3. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . e m n l p - m a i n . 3 . [5] V. Malik, R. Sanjay, S. K. Guha, A. Hazarika, [12] A. Modi, P. Kalamkar, S. Karn, A. Tiwari, A. Joshi, S. Nigam, A. Bhattacharya, A. Modi, Semantic S. K. Tanikella, S. K. Guha, S. Malhan, V. Raghavan, segmentation of legal documents via rhetorical Semeval 2023 task 6: Legaleval - understanding roles, in: Proceedings of the Natural Legal Lan- legal texts, 2023. a r X i v : 2 3 0 4 . 0 9 5 4 8 . guage Processing Workshop 2022, Association for [13] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, M. Sun, Computational Linguistics, Abu Dhabi, United Arab How does NLP benefit legal system: A summary of Emirates (Hybrid), 2022, pp. 153–171. URL: https: legal artificial intelligence, in: Proceedings of the //aclanthology.org/2022.nllp-1.13. 58th Annual Meeting of the Association for Com[6] D. Jin, P. Szolovits, Hierarchical neural networks putational Linguistics, Association for Computafor sequential sentence classification in medical sci- tional Linguistics, Online, 2020, pp. 5218–5230. URL: entific abstracts, in: Proceedings of the 2018 Con- https://aclanthology.org/2020.acl-main.466. doi:1 0 . ference on Empirical Methods in Natural Language 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 4 6 6 .

Processing, Association for Computational Linguis- [14] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletics, Brussels, Belgium, 2018, pp. 3100–3109. URL: tras, I. Androutsopoulos, LEGAL-BERT: The muphttps://aclanthology.org/D18-1349. doi:1 0 . 1 8 6 5 3 / pets straight out of law school, in: Findings v 1 / D 1 8 - 1 3 4 9 . of the Association for Computational Linguistics: [7] A. Brack, A. Hoppe, P. Buschermöhle, R. Ewerth, EMNLP 2020, Association for Computational LinCross-domain multi-task learning for sequential guistics, Online, 2020, pp. 2898–2904. URL: https: sentence classification in research papers, in: //aclanthology.org/2020.findings-emnlp.261. doi:1 0 . Proceedings of the 22nd ACM/IEEE Joint Confer- 1 8 6 5 3 / v 1 / 2 0 2 0 . f i n d i n g s - e m n l p . 2 6 1 . ence on Digital Libraries, JCDL ’22, Association [15] I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, for Computing Machinery, New York, NY, USA, I. Androutsopoulos, D. Katz, N. Aletras, LexGLUE: 2022. URL: https://doi.org/10.1145/3529372.3530922. A benchmark dataset for legal language understanddoi:1 0 . 1 1 4 5 / 3 5 2 9 3 7 2 . 3 5 3 0 9 2 2 . ing in English, in: Proceedings of the 60th Annual [8] P. Kalamkar, A. Tiwari, A. Agarwal, S. Karn, Meeting of the Association for Computational LinS. Gupta, V. Raghavan, A. Modi, Corpus for au- guistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, Comput. Surv. (2023). URL: https://doi.org/10.1145/ pp. 4310–4330. URL: https://aclanthology.org/2022. 3586074. doi:1 0 . 1 1 4 5 / 3 5 8 6 0 7 4 , just Accepted. acl-long.297. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . a c l - l o n g . 2 9 7 . [24] M. Lippi, P. Torroni, Argumentation mining: State [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, of the art and emerging trends, ACM Trans. InterL. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, net Technol. 16 (2016). URL: https://doi.org/10.1145/ Attention is all you need, in: I. Guyon, U. V. 2850417. doi:1 0 . 1 1 4 5 / 2 8 5 0 4 1 7 .

Luxburg, S. Bengio, H. Wallach, R. Fergus, [25] H. Yamada, S. Teufel, T. Tokunaga, BuildS. Vishwanathan, R. Garnett (Eds.), Advances ing a corpus of legal argumentation in in Neural Information Processing Systems, vol- japanese judgement documents: towards ume 30, Curran Associates, Inc., 2017. URL: https: structure-based summarisation, Artificial In//proceedings.neurips.cc/paper_files/paper/2017/ telligence and Law 27 (2019) 141–170. URL: file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. https://doi.org/10.1007/s10506-019-09242-3. [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, doi:1 0 . 1 0 0 7 / s 1 0 5 0 6 - 0 1 9 - 0 9 2 4 2 - 3 .

O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, [26] J. W. G. Putra, S. Teufel, T. Tokunaga, ParsRoberta: A robustly optimized bert pretraining ap- ing argumentative structure in English-as-foreignproach, 2019. a r X i v : 1 9 0 7 . 1 1 6 9 2 . language essays, in: Proceedings of the 16th Work[18] Y. Jernite, S. R. Bowman, D. Sontag, Discourse- shop on Innovative Use of NLP for Building Edbased objectives for fast unsupervised sentence rep- ucational Applications, Association for Computaresentation learning, 2017. a r X i v : 1 7 0 5 . 0 0 5 5 7 . tional Linguistics, Online, 2021, pp. 97–109. URL: [19] D. Iter, K. Guu, L. Lansing, D. Jurafsky, Pretrain- https://aclanthology.org/2021.bea-1.10. ing with contrastive sentence objectives improves [27] M. Pagliardini, P. Gupta, M. Jaggi, Unsupervised discourse performance of language models, in: Pro- learning of sentence embeddings using composiceedings of the 58th Annual Meeting of the As- tional n-gram features, in: Proceedings of the sociation for Computational Linguistics, Associa- 2018 Conference of the North American Chapter tion for Computational Linguistics, Online, 2020, of the Association for Computational Linguistics: pp. 4859–4870. URL: https://aclanthology.org/2020. Human Language Technologies, Volume 1 (Long acl-main.439. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 4 3 9 . Papers), Association for Computational Linguistics, [20] L. Yang, M. Zhang, C. Li, M. Bendersky, M. Na- New Orleans, Louisiana, 2018, pp. 528–540. URL: jork, Beyond 512 tokens: Siamese multi-depth https://aclanthology.org/N18-1049. doi:1 0 . 1 8 6 5 3 / transformer-based hierarchical encoder for long- v 1 / N 1 8 - 1 0 4 9 . form document matching, in: Proceedings of the [28] A. Cohan, I. Beltagy, D. King, B. Dalvi, D. Weld, Pre29th ACM International Conference on Information trained language models for sequential sentence amp; Knowledge Management, CIKM ’20, Associa- classification, in: Proceedings of the 2019 Confertion for Computing Machinery, New York, NY, USA, ence on Empirical Methods in Natural Language 2020, pp. 1725–1734. URL: https://doi.org/10.1145/ Processing and the 9th International Joint Con3340531.3411908. doi:1 0 . 1 1 4 5 / 3 3 4 0 5 3 1 . 3 4 1 1 9 0 8 . ference on Natural Language Processing (EMNLP[21] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, IJCNLP), Association for Computational LinguisE. Hovy, Hierarchical attention networks for tics, Hong Kong, China, 2019, pp. 3693–3699. URL: document classification, in: Proceedings of the https://aclanthology.org/D19-1383. doi:1 0 . 1 8 6 5 3 / 2016 Conference of the North American Chapter v 1 / D 1 9 - 1 3 8 3 . of the Association for Computational Linguistics: [29] J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Human Language Technologies, Association for Dauphin, Convolutional sequence to sequence Computational Linguistics, San Diego, California, learning, in: International conference on machine 2016, pp. 1480–1489. URL: https://aclanthology.org/ learning, PMLR, 2017, pp. 1243–1252.

N16-1174. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 6 - 1 1 7 4 . [30] B. Wang, L. Shan, C. Lioma, X. Jiang, H. Yang, Q. Liu, [22] M. Lukasik, B. Dadachev, K. Papineni, G. Simões, J. Simonsen, On position embeddings in bert, 2021, Text segmentation by cross segment attention, pp. 1–21. 9th International Conference on Learning in: Proceedings of the 2020 Conference on Em- Representations - ICLR 2021 ; Conference date: 03pirical Methods in Natural Language Processing 05-2021 Through 07-05-2021. (EMNLP), Association for Computational Linguis- [31] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey tics, Online, 2020, pp. 4707–4716. URL: https: of transformers, AI Open 3 (2022) 111–132. //aclanthology.org/2020.emnlp-main.380. doi:1 0 . URL: https://www.sciencedirect.com/science/ 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - m a i n . 3 8 0 . article/pii/S2666651022000146. doi:h t t p s : [23] Q. Fournier, G. M. Caron, D. Aloise, A practical / / d o i . o r g / 1 0 . 1 0 1 6 / j . a i o p e n . 2 0 2 2 . 1 0 . 0 0 1 . survey on faster and lighter transformers, ACM [32] P.-C. Chen, H. Tsai, S. Bhojanapalli, H. W. Chung, Y.-W. Chang, C.-S. Ferng, A simple and efective positional encoding for transformers, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 2974–2988.

URL: https://aclanthology.org/2021.emnlp-main.

236. doi:10.18653/v1/2021.emnlp- main.236. [33] I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis,

N. Aletras, I. Androutsopoulos, P. Malakasiotis, Paragraph-level rationale extraction through regularization: A case study on European court of human rights cases, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 226–241.

URL: https://aclanthology.org/2021.naacl-main.22. doi:10.18653/v1/2021.naacl- main.22.