<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Pre-Trained Language Models with Sentence Position Embeddings for Rhetorical Roles Recognition in Legal Opinions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anas Belfathi</string-name>
          <email>anas.belfathi@edu.univ-paris13.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Hernandez</string-name>
          <email>nicolas.hernandez@univ-nantes.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Monceaux</string-name>
          <email>laura.monceaux@univ-nantes.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Role, Sequence Labelling</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nantes Université, École Centrale Nantes</institution>
          ,
          <addr-line>CNRS, LS2N, UMR 6004, F-44000 Nantes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The legal domain is a vast and complex field that involves a considerable amount of text analysis, including laws, legal arguments, and legal opinions. Legal practitioners must analyze these texts to understand legal cases, research legal precedents, and prepare legal documents. The size of legal opinions continues to grow, making it increasingly challenging to develop a model that can accurately predict the rhetorical roles of legal opinions given their complexity and diversity. In this research paper, we propose a novel model architecture for automatically predicting rhetorical roles using pre-trained language models (PLMs) enhanced with knowledge of sentence position information within a document. Based on an annotated corpus from the LegalEval@SemEval2023 competition, we demonstrate that our approach requires fewer parameters, resulting in lower computational costs when compared to complex architectures employing a hierarchical model in a global-context, yet it achieves great performance. Moreover, we show that adding more attention to a hierarchical model based only on BERT in the local-context, along with incorporating sentence position information, enhances the results.</p>
      </abstract>
      <kwd-group>
        <kwd>Pre-trained language models</kwd>
        <kwd>Discourse structure modeling</kwd>
        <kwd>Legal Opinions</kwd>
        <kwd>Sentence Positional Embeddings</kwd>
        <kwd>Rhetorical</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Pre-trained language models, such as BERT [1] and GPT3
[2], have shown significant improvements in
performance across various Natural Language Processing (NLP)
tasks. However, when it comes to apply these models to
specific domains like legal documents, unique challenges
arise. Legal documents are often lengthy and without
explicit structure, requiring the identification of coherent
parts, known as Rhetorical Roles (RRs) for tasks such as
summarization, information extraction, and legal
reasoning [3, 4, 5].</p>
      <p>In this research work, we are interested in the task of
rhetorical role prediction in legal judgements. In that
context, examples of RRs are: PREAMBLE (meta-data related
to the legal judgment document), FACTS (chronology of
events that led to filing the case), RPC (final decisions
ruled by the present court), etc. In particular, we work
with the dataset provided by the organizers of the
SemEval 2023 LegalEval competition for the rhetorical role
prediction task1. For this task, Hierarchical Sequential</p>
      <sec id="sec-1-1">
        <title>Labelling Network (HSLN) [6, 4, 7, 8] and Pre-trained</title>
      </sec>
      <sec id="sec-1-2">
        <title>Language Models (PLM) which are able to handle long</title>
        <p>Proceedings of the Sixth Workshop on Automated Semantic Analysis of
CEUR
htp:/ceur-ws.org
ISN1613-073
https://sites.google.com/view/legaleval.
ple text statistics on the data show that a text contains
on average 4346.07 sub-words2 (±2151.08) and therefore
exceeds the maximum input length any Pre-trained
Language Models can handle. In addition, the best current
system does not exceed 87% of F1-score which justifies
an interest in the task [12].</p>
      </sec>
      <sec id="sec-1-3">
        <title>Our main contributions are the following:</title>
        <p>• we enhance the pre-trained language model BERT
with sentence position information at input;
• we study the sentence position information under
various representations (absolute, normalized and
K–quantile);
• we consider two architectures to contextualize the
sentence representations: 1) a single BERT encoder
and 2) a hierarchical model made of a BERT encoder
layer to encode sentences and a shallow encoder
(with two Transformer layers) to contextualize a
sentence with its surrounding sentences;
• we evaluate these various models in the context of
a rhetorical role sequence labelling task for legal
judgments.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Our related work section (Section 2) covers various</title>
        <p>topics from the fusion of discourse information in
language models to some rhetorical role prediction system</p>
      </sec>
      <sec id="sec-1-5">
        <title>2Computed with the BERT tokenizer.</title>
        <p>architectures including a position embedding presenta- Apart from these more complex architectures, [22]
tion. Then, in Section 3, we detail the models we propose. showed that, for a text segmentation task, looking at
In Section 4, we present the methodology we use for our the local context around each candidate break (by taking
ifne-tuning evaluation. Finally, we present our result and the end of the previous sentences and the beginning of
discuss our future work in the last sections. the following as input) is suficient to obtain comparable
Our code will be publicly available on MASKED_URL. performance to the HAN architectures.</p>
        <p>Rhetorical Role Prediction The analysis of the
rhetorical text4 structure includes several tasks [24, 25,
2. Related work 26, 8, 12]: 1) text segmentation into text rhetorical units,
2) rhetorical role identification of each text unit, 3)
strucInjecting discourse information in language mod- ture prediction, which links the text units together and
els Legal texts share linguistic characteristics specific to 4) relations labelling to name the connections. We focus
the legal domain (and often to a legal sub-domain). They here in the task of rhetorical role identification which has
have legal jargon, long sentences, unusual word order been considered in the literature as a sequence labelling
and long length [13, 14, 8, 15]. These characteristics do task taking the sentence as a minimal text unit [12].
not allow to take full advantage of the state-of-the-art As reported by [12], the Hierarchical Sequential
Lalanguage models trained on the general domain and even belling Network (HSLN) remains the most eficient
archishow their limitations since most of them cannot handle tecture for this task (at least for the LegalEval dataset)
a text length which goes beyond their maximum input [6, 4, 7, 8]. The model first encodes sentence
representalength. tions (e.g. by using sent2vec [27] or any PLM like BERT</p>
        <p>Transformers [16] sufer from a quadratic computa- or LegalBERT). Then it contextualizes the sentence
repretional and memory complexity with respect to the se- sentations (e.g. through a BiLSTM layer) and eventually
quence length. This lead most of the SOTA models (e.g. predicts the label sequence thanks to a sequence labelling
BERT [1], RoBERTa [17], LegalBERT [14]) to adopt 512 layer (like a CRF). On the LegalEval dataset, best
perforas their maximum sequence length. mance were obtained by participants who used
domain</p>
        <p>Pre-training or retraining with discourse-based objec- adaptation techniques like a pre-train language model
tives can result in sentence and text representation that trained on Legal text, or augmented datasets. The
baseare more adapted to addressing NLP tasks at the dis- line model were based on the HSLN architecture and had
course level [18, 19, 20] but this kind of approach does a performance of 79% F1 score. The proposed methods
not address directly the limitation of the input length. show an improvement over the baseline without ever
suc</p>
        <p>In terms of neural architecture adaptation, Hierarchi- ceeding to outperform by more than seven points. [28]
cal Attention Networks (HANs) have been proposed to showed that a single BERT can be suficient to capture
model a sequence of sentences by stacking two layers contextual dependencies without the need for
hierarchiof encoders: one to capture the word sequences and an- cal contextual encoding neither a CRF sequence labeler.
other (taking the former as input) to capture the sentence The approach uses BERT to encode a concatenation of
sequence. This architecture has been shown to perform sentences (fixed at 10 sentences) and use a MLP over each
significantly better than single layer encoders for text encoded sentence separator token to predict the
correclassification [ 21], text segmentation [22], recommenda- sponding sentence label. Despite its low complexity, the
tion [20] and sequence labelling tasks [8]. To avoid an approach is limited by the length of the input sequence
important number of padded words in the first layer, [ 20] and requires to tile the whole text to obtain the whole
proposed to concatenate as many as natural sentences label sequence. [8] did not confirm the efectiveness of
the input block can fit. Although extending the scope of the method on the LegalEval dataset.
sniontgsleoltvreantshfeorcmomerpelenxciotdyepr,rtohbeleHmAsNanadrchhaitsetcotuprerodcoeesss chiPteocstiutiroenisenmobtesdendsinitgivseBtyo nthateuorer,dtehreoTfriannpsufotrtmokerenars-.
ifxed length of sentence sequence and truncate too long To make the model position-aware, the position
infordocuments. Recent architectures such as Longformer mation of the input words is typically added as an
ad[9], BigBird [10] or Ernie [11] succeeded in extending to ditional embedding to the input token embeddings [16].
4096 the maximum sequence length while reducing the While absolute sinusoidal position encodings were
utiTransformer’s complexity3 by introducing sparsity into lized in the vanilla Transformer, some works showed that
attention layers (i.e. by allowing each token position to learned position embeddings can provide more flexibility
attend to a subset of token positions with respect to some in adapting to diferent tasks through back-propagation
sparse patterns). To improve such models, [15] suggested [1, 29], instead of using hand-crafted position
represento consider the actual logical structure of documents.
3See [23] for a survey of techniques to address Transformer’s limita- 4We do not make a distinction here between the intra- and
intertions. sentencial levels.
tations [30, 31]. Multiple works explored also diferent
token position information (absolute, relative) and ways
to include it in Transformers (e.g. in the input or the
attention matrix) [30, 32]. Very few were interested in
sentence position information. [20] indicate to fuse sen- Figure 2: Average sentence position variation by label on the
tence block representations with sentence block position dataset.
embeddings but without mentioning precisely the nature
of the position (relative, absolute...) and how the position
table is built. be crucial for identifying the RR of a particular sentence.</p>
        <p>BERT Input Representation BERT is one of the To address this limitation, we propose the addition of
models that utilizes three types of learned embedding sentence position embeddings to the BERT embedding,
layers (See Figure 1): Token Embeddings, Segment Em- which aims to enhance the performance of RR prediction
beddings, and Position Embeddings [1]: in legal opinions.</p>
        <p>• Token Embeddings: This layer is responsible for
converting each word in the input text into a
fixeddimensional vector representation. In the BERT
Base model, each word is represented as a
768dimensional vector.
• Segment Embeddings: This layer has the task of
distinguishing between the inputs in a given pair by
assigning one of two vector representations to each
token in the input.
• Position Embeddings: BERT takes into account
the sequential nature of input sequences by learning
a vector representation for each token position in the
input. The Position Embeddings layer is a lookup
table of size (512, 768), where each row represents the
vector representation of a word at a specific position
in the input sequence.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Fusion sentence position embeddings at input</title>
      <p>In this section, we focus on the approach that we have
developed for injecting discursive knowledge without
pre-training through input embeddings. The use of
input embeddings has been a popular approach in natural
language processing (NLP) for representing text data in
a high-dimensional vector space. By incorporating
discursive knowledge into these embeddings, we aim to
improve the performance of NLP models without the
need for pre-training.</p>
      <sec id="sec-2-1">
        <title>3.1. BERT-SentPos: BERT sentence encoder enhanced with sentence position embeddings at input</title>
        <p>The representations from Token Embeddings, Seg- In the analysis we have conducted on legal documents,
ment Embeddings, and Position Embeddings are summed we have observed that each rhetorical role has a specific
element-wise to produce a single representation with position within the document (See Figure 2). For example,
shape (1, n, 768), where  is the length of the input se- the preamble role is found at the beginning, the analysis
quence. This combined representation captures contex- role in the middle of the document, and the RPC role
tual information of tokens in the input text. Although at the end. To improve our model’s performance, we
BERT has shown efective results for many tasks such as decided to incorporate additional information that
indiquestion answering and sentiment analysis of tweets, it is cates the position of each sentence in the document. We
non-performant when working with lengthy documents. achieved this by adding an extra embedding layer to the
However, it does not incorporate any information about BERT embedding, which helped us capture the
sequenthe position of sentences within a document, which can tial nature of positions in vector space. Various Position
Embeddings (PEs) have been proposed in Transformer
based architectures [16] to ”capture the sequential nature
of positions in vector space.” These PEs range from fixed
ad-hoc ways to fully learnable ones. BERT, in particular,
uses fully learnable PEs. In the interest of simplicity, we
decided to reuse the learned PEs to represent the
sentence Position Embeddings. By doing so, we were able to
create a more accurate and efective model for analyzing
legal documents.</p>
        <p>In this research, we employed various techniques to
examine the positioning of sentences within legal
documents. Specifically, we explored three diferent ways of
analyzing sentence positions, including absolute position,
normalized position, and k-quantile position :</p>
        <p>Absolute position refers to the location of a sentence
within a particular document in relation to other
sentences in the same document. For instance, we may have
a document that includes:
· Sentence 1: ”The court hereby orders the defendant Figure 3: BERT-SentPos: A BERT Architecture with Sentence
to appear for a hearing on Monday.” Position Embeddings fused at the BERT input layer using the
· Sentence 2: ”The defendant shall provide all relevant Learnable Position Embedding of Tokens.</p>
        <p>documents to the plaintif’s attorney by Friday.”</p>
        <p>In this case, Sentence 1 has an absolute position of
being the first sentence in the document, while Sentence in document length and unique writing styles of judges.
2 has an absolute position of being the second sentence. Figure 3 illustrates our proposed architecture, which</p>
        <p>Normalized position refers to the process of con- utilizes the BERT encoder. We input pairs of
consecuverting the position of a sentence in one document to tive sentences denoted as sentence  and sentence  + 1 ,
a corresponding position in the largest document in a along with the corresponding sentence position of
precorpus that has the maximum length of sentences. This is dicted sentence  , to enable our model to learn
contexdone to ensure consistent normalization across diferent tualized representations that consider the relationships
documents, by aligning sentence positions with respect between neighboring sentences. To capture positional
to a common reference point. For example, let’s consider information, we added BERT embeddings with sentence
Document A, which consists of 50 sentences, and Docu- position embeddings. These embeddings provide the
ment B, the largest document in the corpus, containing model with position information of each sentence within
100 sentences. If we take Sentence 25 from Document A, the document. The combined representation is passed
we can calculate its normalized position as (25 multiplied through encoders and feed-forward layers for
predictby 100 divided by 50) = 50. This aligns Sentence 25 with ing the rhetorical role of sentence  . At the end we
the same relative position in Document B. obtained three models: BERT enhanced with Absolute</p>
        <p>K–quantile position The analysis of legal documents Sentence Position Embeddings (BERT-AbsPos), BERT
often involves addressing variations in document length enhanced with 8-Quantile Sentence Position
(BERT-8and unique writing styles of judges. This technique in- QuantilePos), BERT enhanced with Normalized Sentence
volves dividing the documents into k parts or quantiles Position (BERT-NorPos).
to help control the absolute position of each rhetorical
role. For example: We have an Document X (divided into 3.2. HiBERT: Hierarchical variant to
4 quantiles):
· Quantile 1: ”The court presents the background of contextualize sentence
the case.” representations
· Quantile 2: ”The court discusses relevant legal prece- In this section, we present HiBERT, a hierarchical variant
dents.” of BERT-SentPos described in Section 3.1 (See Figure 4).
· Quantile 3: ”The court analyzes the evidence pre- The model is based on the [14, 33]’s hierarchical model.</p>
        <p>sented by both parties.” The model aims to label a sentence according to the
sen· Quantile 4: ”The court renders its final decision and tences that precede and follow at a certain range. Each
issues the judgment.” sentence representation is enhanced with sentence
posiBy dividing the opinion into quantiles, each containing tional information before being encoded by a BERT
ena specific role or aspect, it helps in addressing variations</p>
      </sec>
      <sec id="sec-2-2">
        <title>4.1. The LegalEval Dataset</title>
        <sec id="sec-2-2-1">
          <title>We utilized the data supplied by Sub-task A “Rhetorical</title>
          <p>Roles Prediction” of the SemEval 2023 Task 6 “LegalEval
Understanding Legal Texts” challenge5. The dataset
comprises Indian legal data extracted from court judgments
coder. The process generates a top-level representation, and includes 13 diferent RRs, with the details and
definidenoted as   [] , for each sentence. These sentence tions for each RR provided in the article by Kalamkar et
representations are then fed into a 2-layered transformer al. [8]. The average number of sentences per document
encoder to contextualize a sentence sequence. This en- is 117.31.
coder follows the same specifications as BERT, including To prepare the dataset, we kept the same LegalEval
hidden units and the number of attention heads. Even- split (train and validation data). We used the 90% of the
tually we utilize a dense layer to predict the label of the train data to train and the remaining 10% to validate the
sentence in focus in the current sentence sequence. model. Furthermore, we used the original validation data</p>
          <p>We have set a window of surrounding sentences to to evaluate the performance of the trained models. The
a maximum of ±7 sentences. This range size corre- statistics for splitting the corpus are shown in Table 1.
sponds to the average number of consecutive sentences
with the same RR label in our train dataset. Four vari- 4.2. Hyperparameters
ations of HiBERT were experimented: HiBERT-AbsPos,
which uses Absolute Sentence Positions with the max- For the fine-tuning setup, hyperparameters are
deterimum window size; HiBERT-NorPos, which uses Nor- mined through experimentation and analysis. The batch
malized Sentence Position with the maximum window size is set to 8, taking into consideration the available
size; HiBERT-AbsPosHalf, which uses Absolute Sentence computational resources and model performance. The
Position with half of the maximum window size; and learning rate is set to 2e-5, which is a commonly used
HiBERT-NorPosHalf, which uses Normalized Sentence value for fine-tuning NLP models. The epoch number
Positions with half of the maximum window size. Over- is chosen from {1, 2, 3}, with the final epoch number
all, the hierarchical model with more attention attention selected based on a balance between training time and
is an efective solution for processing legal documents model performance for each model.
that consist of thousands of words. By using a
hierarchical approach and taking into account the context of 4.3. Performance measures
the text, the model is able to efectively process lengthy
documents and make accurate classifications based on
the document’s content.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental methodology</title>
      <sec id="sec-3-1">
        <title>We evaluate our contributions in the fine-tuning phase of pre-trained models.</title>
      </sec>
      <sec id="sec-3-2">
        <title>The performance of the NLP models for the rhetori</title>
        <p>cal roles task is assessed using Weighted-Precision (  ),
Weighted-Recall (  ), Accuracy ( ), Weighted-F1 (  1 )
and Macro F1 (  1 ) scores based on the hidden test set.
The weighted F1 score considers both precision and recall,
and it is calculated by taking into account the class-wise
F1 scores weighted by the number of samples in each
class. The Macro F1 score provides an overall assessment
of model performance by calculating the F1 score for
each class independently and then taking the average.
0.79 0.81 0.81 0.79 0.57
a smaller context window can be explained by the fact
that the semantics of a sentence are usually more
dependent on local context than on knowing all sentences in a
paragraph, for example.</p>
        <p>Overall, our experiments demonstrated that using
different types of sentence position information can
significantly improve the performance of BERT on legal
document classification tasks. Additionally, a
hierarchical model that combines BERT with absolute position
information and a window size of 4 sentences (⌈7/2⌉)
can further enhance the performance of our proposed
models. Furthermore, our approach also achieved low
computational time, making it eficient and practical for
real-world applications.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>In this section, we present the experimental results of
our study on injecting diferent types of sentence
position embeddings in the BERT model (See Table 2). As
mentioned earlier, we used the weighted F1 score as our
primary performance metric. As a baseline, we report the
score obtained by the BERT-HSLN6 model [12] which
achieved a performance of 79%. Our first experiment
was conducted using BERT with three types of sentence
position embeddings: Absolute position (BERT-AbsPos),</p>
      <sec id="sec-4-1">
        <title>Normalized position (BERT-NorPos), and K-quantile po</title>
        <p>sition(BERT-8-QuantilePos). Our results revealed that
BERT performs comparatively poorly (65%) compared to
the proposed models when considering three types of
position. For the K-quantile position, we experimented
with diferent numbers of parts and found that the best
division based on performance is with 8 parts. However,
we found that BERT with normalized position achieved
a better score of 75%. We attribute this improvement to
our eforts in controlling the variation in length across
diferent legal documents.</p>
      </sec>
      <sec id="sec-4-2">
        <title>To further enhance the performance of the recogni</title>
        <p>tion of RRs, we experimented with a hierarchical model
that combines BERT with more attention and a window
size equal to the average of consecutive sentences with
the same RR label, while also injecting sentence
posi</p>
      </sec>
      <sec id="sec-4-3">
        <title>Unfortunately, we did not observe any significant im</title>
        <p>provement of the results. Subsequently, we halved the
average window size of sentences to take into
consideration (HiBERT-AbsPosHalf and HiBERT-NorPosHalf).
This led to an improvement in the results, particularly
with Absolute position by 79%. We attribute this success
to the fact that absolute position captures the specific
position of a sentence within a document, providing crucial
contextual information of the predicted sentence based
on its surrounding sentences. This improvement with</p>
      </sec>
      <sec id="sec-4-4">
        <title>6This work is part of OpenNyAI https://opennyai.org/ mission, which</title>
        <p>is funded by the EkStep Foundation https://ekstep.org/.
tion information (HiBERT-AbsPos and HiBERT-NorPos). the fact that current state-of-the-art models may not be</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion and Future</title>
    </sec>
    <sec id="sec-6">
      <title>Work</title>
      <sec id="sec-6-1">
        <title>The results of our study indicate that the inclusion of</title>
        <p>absolute, normalized, and k-quantile positional
embeddings can significantly improve the performance of both
BERT base and its hierarchical variant for the
Rhetorical Roles Prediction Task. However, there is a room for
improvement. It should be noted, for example, that one
potential limitation of this approach is that the number
of sentences in a given document may exceed the
dimensions of the learnable embedding matrix. The framework
proposed by [30] can help us to determine the interest
of learning dedicated sentence position embeddings or
using sinusoidal PEs. In addition, in the context of
token position embeddings, [32] showed that encoding
position to attention matrix per-head results in superior
performance comparing to adding position embeddings
to the input. A similar experiment could be conducted on
the sentence position embeddings. Looking ahead, our
future work aims to develop a new architecture with a
greater number of layers that incorporates additional
features such as metadata pertaining to each legal opinion.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Furthermore, we plan to explore the potential benefits of</title>
        <p>pre-training instead of fine-tuning and fine-tuning with
LegalBERT [14], a BERT-based model pre-trained on
legal text. Our motivation for this direction is based on
suficient to achieve optimal performance for legal NLP
tasks. By incorporating these improvements, we hope to
develop a more robust architecture that achieves optimal
performance for the Rhetorical Roles Prediction Task and
other legal NLP tasks.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>This research was funded, in whole or in part, by l’Agence</title>
      </sec>
      <sec id="sec-7-2">
        <title>Nationale de la Recherche (ANR), project ANR-22-CE380004.</title>
        <p>tomatic structuring of legal documents, in:
Proceedings of the Thirteenth Language Resources
[1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: and Evaluation Conference, European Language
Pre-training of deep bidirectional transformers for Resources Association, Marseille, France, 2022,
language understanding, in: Proceedings of the pp. 4420–4429. URL: https://aclanthology.org/2022.
2019 Conference of the North American Chap- lrec-1.470.
ter of the Association for Computational Linguis- [9] I. Beltagy, M. E. Peters, A. Cohan,
Longtics: Human Language Technologies, Volume 1 former: The long-document transformer, 2020.
(Long and Short Papers), Association for Com- a r X i v : 2 0 0 4 . 0 5 1 5 0 .
putational Linguistics, Minneapolis, Minnesota, [10] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
2019, pp. 4171–4186. URL: https://aclanthology.org/ C. Alberti, S. Ontanon, P. Pham, A. Ravula,
N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . Q. Wang, L. Yang, A. Ahmed, Big bird:
Trans[2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- formers for longer sequences, in: H. Larochelle,
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- M. Ranzato, R. Hadsell, M. Balcan, H. Lin
try, A. Askell, et al., Language models are few-shot (Eds.), Advances in Neural Information
Prolearners, Advances in neural information process- cessing Systems, volume 33, Curran Associates,
ing systems 33 (2020) 1877–1901. Inc., 2020, pp. 17283–17297. URL: https://
[3] M. Saravanan, B. Ravindran, S. Raman, Auto- proceedings.neurips.cc/paper_files/paper/2020/file/
matic identification of rhetorical roles using con- c8512d142a2d849725f31a9a7a361ab9-Paper.pdf.
ditional random fields for legal document summa- [11] X. Ouyang, S. Wang, C. Pang, Y. Sun, H. Tian,
rization, in: Proceedings of the Third International H. Wu, H. Wang, ERNIE-M: Enhanced
multiJoint Conference on Natural Language Processing: lingual representation by aligning cross-lingual
Volume-I, 2008. semantics with monolingual corpora, in:
Pro[4] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, A. Z. ceedings of the 2021 Conference on Empirical
Wyner, Identification of rhetorical roles of sen- Methods in Natural Language Processing,
Assotences in indian legal judgments, in: The Thirty- ciation for Computational Linguistics, Online and
second International Conference on Legal Knowl- Punta Cana, Dominican Republic, 2021, pp. 27–38.
edge and Information Systems (JURIX), volume 322, URL: https://aclanthology.org/2021.emnlp-main.3.
2019, p. 3. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . e m n l p - m a i n . 3 .
[5] V. Malik, R. Sanjay, S. K. Guha, A. Hazarika, [12] A. Modi, P. Kalamkar, S. Karn, A. Tiwari, A. Joshi,
S. Nigam, A. Bhattacharya, A. Modi, Semantic S. K. Tanikella, S. K. Guha, S. Malhan, V. Raghavan,
segmentation of legal documents via rhetorical Semeval 2023 task 6: Legaleval - understanding
roles, in: Proceedings of the Natural Legal Lan- legal texts, 2023. a r X i v : 2 3 0 4 . 0 9 5 4 8 .
guage Processing Workshop 2022, Association for [13] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, M. Sun,
Computational Linguistics, Abu Dhabi, United Arab How does NLP benefit legal system: A summary of
Emirates (Hybrid), 2022, pp. 153–171. URL: https: legal artificial intelligence, in: Proceedings of the
//aclanthology.org/2022.nllp-1.13. 58th Annual Meeting of the Association for
Com[6] D. Jin, P. Szolovits, Hierarchical neural networks putational Linguistics, Association for
Computafor sequential sentence classification in medical sci- tional Linguistics, Online, 2020, pp. 5218–5230. URL:
entific abstracts, in: Proceedings of the 2018 Con- https://aclanthology.org/2020.acl-main.466. doi:1 0 .
ference on Empirical Methods in Natural Language 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 4 6 6 .</p>
        <p>Processing, Association for Computational Linguis- [14] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N.
Aletics, Brussels, Belgium, 2018, pp. 3100–3109. URL: tras, I. Androutsopoulos, LEGAL-BERT: The
muphttps://aclanthology.org/D18-1349. doi:1 0 . 1 8 6 5 3 / pets straight out of law school, in: Findings
v 1 / D 1 8 - 1 3 4 9 . of the Association for Computational Linguistics:
[7] A. Brack, A. Hoppe, P. Buschermöhle, R. Ewerth, EMNLP 2020, Association for Computational
LinCross-domain multi-task learning for sequential guistics, Online, 2020, pp. 2898–2904. URL: https:
sentence classification in research papers, in: //aclanthology.org/2020.findings-emnlp.261. doi:1 0 .
Proceedings of the 22nd ACM/IEEE Joint Confer- 1 8 6 5 3 / v 1 / 2 0 2 0 . f i n d i n g s - e m n l p . 2 6 1 .
ence on Digital Libraries, JCDL ’22, Association [15] I. Chalkidis, A. Jana, D. Hartung, M. Bommarito,
for Computing Machinery, New York, NY, USA, I. Androutsopoulos, D. Katz, N. Aletras, LexGLUE:
2022. URL: https://doi.org/10.1145/3529372.3530922. A benchmark dataset for legal language
understanddoi:1 0 . 1 1 4 5 / 3 5 2 9 3 7 2 . 3 5 3 0 9 2 2 . ing in English, in: Proceedings of the 60th Annual
[8] P. Kalamkar, A. Tiwari, A. Agarwal, S. Karn, Meeting of the Association for Computational
LinS. Gupta, V. Raghavan, A. Modi, Corpus for au- guistics (Volume 1: Long Papers), Association for
Computational Linguistics, Dublin, Ireland, 2022, Comput. Surv. (2023). URL: https://doi.org/10.1145/
pp. 4310–4330. URL: https://aclanthology.org/2022. 3586074. doi:1 0 . 1 1 4 5 / 3 5 8 6 0 7 4 , just Accepted.
acl-long.297. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . a c l - l o n g . 2 9 7 . [24] M. Lippi, P. Torroni, Argumentation mining: State
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, of the art and emerging trends, ACM Trans.
InterL. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, net Technol. 16 (2016). URL: https://doi.org/10.1145/
Attention is all you need, in: I. Guyon, U. V. 2850417. doi:1 0 . 1 1 4 5 / 2 8 5 0 4 1 7 .</p>
        <p>Luxburg, S. Bengio, H. Wallach, R. Fergus, [25] H. Yamada, S. Teufel, T. Tokunaga,
BuildS. Vishwanathan, R. Garnett (Eds.), Advances ing a corpus of legal argumentation in
in Neural Information Processing Systems, vol- japanese judgement documents: towards
ume 30, Curran Associates, Inc., 2017. URL: https: structure-based summarisation, Artificial
In//proceedings.neurips.cc/paper_files/paper/2017/ telligence and Law 27 (2019) 141–170. URL:
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. https://doi.org/10.1007/s10506-019-09242-3.
[17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, doi:1 0 . 1 0 0 7 / s 1 0 5 0 6 - 0 1 9 - 0 9 2 4 2 - 3 .</p>
        <p>O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, [26] J. W. G. Putra, S. Teufel, T. Tokunaga,
ParsRoberta: A robustly optimized bert pretraining ap- ing argumentative structure in
English-as-foreignproach, 2019. a r X i v : 1 9 0 7 . 1 1 6 9 2 . language essays, in: Proceedings of the 16th
Work[18] Y. Jernite, S. R. Bowman, D. Sontag, Discourse- shop on Innovative Use of NLP for Building
Edbased objectives for fast unsupervised sentence rep- ucational Applications, Association for
Computaresentation learning, 2017. a r X i v : 1 7 0 5 . 0 0 5 5 7 . tional Linguistics, Online, 2021, pp. 97–109. URL:
[19] D. Iter, K. Guu, L. Lansing, D. Jurafsky, Pretrain- https://aclanthology.org/2021.bea-1.10.
ing with contrastive sentence objectives improves [27] M. Pagliardini, P. Gupta, M. Jaggi, Unsupervised
discourse performance of language models, in: Pro- learning of sentence embeddings using
composiceedings of the 58th Annual Meeting of the As- tional n-gram features, in: Proceedings of the
sociation for Computational Linguistics, Associa- 2018 Conference of the North American Chapter
tion for Computational Linguistics, Online, 2020, of the Association for Computational Linguistics:
pp. 4859–4870. URL: https://aclanthology.org/2020. Human Language Technologies, Volume 1 (Long
acl-main.439. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 4 3 9 . Papers), Association for Computational Linguistics,
[20] L. Yang, M. Zhang, C. Li, M. Bendersky, M. Na- New Orleans, Louisiana, 2018, pp. 528–540. URL:
jork, Beyond 512 tokens: Siamese multi-depth https://aclanthology.org/N18-1049. doi:1 0 . 1 8 6 5 3 /
transformer-based hierarchical encoder for long- v 1 / N 1 8 - 1 0 4 9 .
form document matching, in: Proceedings of the [28] A. Cohan, I. Beltagy, D. King, B. Dalvi, D. Weld,
Pre29th ACM International Conference on Information trained language models for sequential sentence
amp; Knowledge Management, CIKM ’20, Associa- classification, in: Proceedings of the 2019
Confertion for Computing Machinery, New York, NY, USA, ence on Empirical Methods in Natural Language
2020, pp. 1725–1734. URL: https://doi.org/10.1145/ Processing and the 9th International Joint
Con3340531.3411908. doi:1 0 . 1 1 4 5 / 3 3 4 0 5 3 1 . 3 4 1 1 9 0 8 . ference on Natural Language Processing
(EMNLP[21] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, IJCNLP), Association for Computational
LinguisE. Hovy, Hierarchical attention networks for tics, Hong Kong, China, 2019, pp. 3693–3699. URL:
document classification, in: Proceedings of the https://aclanthology.org/D19-1383. doi:1 0 . 1 8 6 5 3 /
2016 Conference of the North American Chapter v 1 / D 1 9 - 1 3 8 3 .
of the Association for Computational Linguistics: [29] J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N.
Human Language Technologies, Association for Dauphin, Convolutional sequence to sequence
Computational Linguistics, San Diego, California, learning, in: International conference on machine
2016, pp. 1480–1489. URL: https://aclanthology.org/ learning, PMLR, 2017, pp. 1243–1252.</p>
        <p>N16-1174. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 6 - 1 1 7 4 . [30] B. Wang, L. Shan, C. Lioma, X. Jiang, H. Yang, Q. Liu,
[22] M. Lukasik, B. Dadachev, K. Papineni, G. Simões, J. Simonsen, On position embeddings in bert, 2021,
Text segmentation by cross segment attention, pp. 1–21. 9th International Conference on Learning
in: Proceedings of the 2020 Conference on Em- Representations - ICLR 2021 ; Conference date:
03pirical Methods in Natural Language Processing 05-2021 Through 07-05-2021.
(EMNLP), Association for Computational Linguis- [31] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey
tics, Online, 2020, pp. 4707–4716. URL: https: of transformers, AI Open 3 (2022) 111–132.
//aclanthology.org/2020.emnlp-main.380. doi:1 0 . URL: https://www.sciencedirect.com/science/
1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - m a i n . 3 8 0 . article/pii/S2666651022000146. doi:h t t p s :
[23] Q. Fournier, G. M. Caron, D. Aloise, A practical / / d o i . o r g / 1 0 . 1 0 1 6 / j . a i o p e n . 2 0 2 2 . 1 0 . 0 0 1 .
survey on faster and lighter transformers, ACM [32] P.-C. Chen, H. Tsai, S. Bhojanapalli, H. W. Chung,
Y.-W. Chang, C.-S. Ferng, A simple and efective
positional encoding for transformers, in:
Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, Association
for Computational Linguistics, Online and Punta
Cana, Dominican Republic, 2021, pp. 2974–2988.</p>
        <p>URL: https://aclanthology.org/2021.emnlp-main.</p>
        <p>236. doi:10.18653/v1/2021.emnlp- main.236.
[33] I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis,</p>
        <p>N. Aletras, I. Androutsopoulos, P. Malakasiotis,
Paragraph-level rationale extraction through
regularization: A case study on European court of
human rights cases, in: Proceedings of the 2021
Conference of the North American Chapter of the
Association for Computational Linguistics:
Human Language Technologies, Association for
Computational Linguistics, Online, 2021, pp. 226–241.</p>
        <p>URL: https://aclanthology.org/2021.naacl-main.22.
doi:10.18653/v1/2021.naacl- main.22.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>