<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. Comandé);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automatic Rhetorical Roles Classification for Legal Documents using LEGAL-TransformerOverBERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriele Marino</string-name>
          <email>gabriele.marino@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Licari</string-name>
          <email>daniele.licari@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Praveen Bushipaka</string-name>
          <email>praveen.bushipaka@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Comandé</string-name>
          <email>giovanni.comande@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Networks (CNN)</institution>
          ,
          <addr-line>and Long-Short Term Memory, LSTM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rhetorical Roles Classification, LEGAL-BERT</institution>
          ,
          <addr-line>Hierarchical Transformers, LEGAL-ToBERT</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Scuola Superiore Sant'Anna</institution>
          ,
          <addr-line>P.zza dei Martiri della Libertà, Pisa, 56100</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Automatic identification of rhetorical roles can help in many downstream applications of legal documents analysis, such as legal decisions summarization and legal search. This is usually a complex task, even for humans, due to its inherent subjectivity and to the dificulty of capturing sentence context in very long legal documents. We propose a novel approach, based on Hierarchical Transformers, which overcomes these problems and achieves promising results on two diferent datasets of Italian and English legal judgments. Specifically, we introduce LEGAL-TransformerOverBERT (LEGAL-ToBERT), a model based on the stacking of a transformer encoder over a legal-domain-specific BERT model, and show that our approach is able to significantly improve the baselines set by the stand-alone LEGAL-BERT models, by capturing the relationships between diferent sentences of the same document. We make our models available and ready-to-use for downstream applications of rhetorical roles classification in the legal context both for the Italian and English language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Proceedings of the Sixth Workshop on Automated Semantic Analysis of
that we named LEGAL-TransformerOverBERT
(LEGAL</p>
      <sec id="sec-1-1">
        <title>ToBERT). Our approach is based on the stacking of a</title>
        <p>transformer encoder on top of a legal-domain-specific
BERT model, creating a hierarchical architecture able to
capture the discursive relationships between sentences,
allowing accurate classification of rhetorical roles. We
also propose a novel positional encoding strategy for
the upper-layer transformer of ToBERT, based on the
sinusoidal encoding of the relative position of a sentence
in the document, and show that this is preferable when
dealing with RRC in the legal context.</p>
      </sec>
      <sec id="sec-1-2">
        <title>As a proof of the efectiveness of our approach, we</title>
        <p>tested our model using two diferent datasets. The first
one is a new yet confidential Italian-language dataset
that we built specifically for this task and named
ITA</p>
      </sec>
      <sec id="sec-1-3">
        <title>RhetRoles; the second one is the English-language BUILD benchmark dataset [5].</title>
      </sec>
      <sec id="sec-1-4">
        <title>We used respectively Italian</title>
        <p>blocks for LEGAL-ToBERT. We then compared the
re© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 LEGAL-BERT [9] and LEGAL-BERT [12] as building
sults of LEGAL-ToBERT with those of the stand-alone
Italian-LEGAL-BERT and LEGAL-BERT, and found that
LEGAL-ToBERT allows for significantly better
performances on both datasets, improving the baseline MCC
respectively by 21% and 30%.</p>
        <p>We make all our code and models publicly available
and ready-to-use for downstream applications of legal
RRC on our Rhetorical Roles Classification GitHub
repository1.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>performances of standard transformers when dealing
with long texts [19, 20, 21].</p>
      <p>Our experiments address RRC using a hierarchical
transformer architecture based on legal-domain-specific
BERT models. To the best of our knowledge, this is the
ifrst attempt combining these two colliding worlds and
using them to build a refined model for RRC in the legal
domain. Our models are available and ready-to-use both
for the Italian and English language. This is the first
time that a fine-tuned model is made available for RRC of
legal documents for the Italian language: it is our sincere
hope that this will enable many downstream applications,
helping to speed up the work of Italian jurists.</p>
      <p>
        In spite of the increasing research in applications of
Artiifcial Intelligence to the legal domain, only limited works
have focused on RRC. One of the earliest works with this 3. Methodology
aim can be traced back to Hachey et al. [13], in which
handmade annotated sentences were used to train tradi- 3.1. Rhetorical Roles Datasets
tional Machine Learning algorithms such as Naive Bayes
and SVM. Moens et al. [14] used Multinomial Naive We used two diferent datasets to compare the
perforBayes classifiers and Maximum Entropy models to ad- mances of our hierarchical model with those of vanillla
dress the problem of argument detection in legal texts, as BERT models. The first one is a novel dataset that we
dea particular case of RRC. Saravanan et al. [15] employed veloped for this work and that we named ITA-RhetRoles,
Conditional Random Fields (CRF) to automatize the RRC the second one is the BUILD benchmark dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Taof legal documents and used the predicted rhetorical roles ble 1 shows an overview of the two datasets in terms of
to rank each sentence and enable a subsequent extractive number of documents and total sentences; both datasets
summarization task. More recently, a work by Ghosh et are described more in details in the following sections.
al. [8] used Hierarchical BiLSTM classifiers with the
adfdoitrioRnRCofoafCInRdFiatno ilmegparlojvuedgthmeesnttasn. dS-taalrotninegCfRroFmbatsheelirnee- Split #IDToAc-sRhet#RSoelnests #DocBsUIL#DSents
sults of this work, Malik et al. [16] proposed a Multi Task VTaraliidn. 1104495 698,6,02102 22241 235,2,73542
Learning (MTL) framework based on the same Hierar- Test 294 18,288 30 2,879
chical BiLSTM with CRF model to significantly improve
the classification scores. Another noteworthy work by Table 1
Walker et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] investigated the use of ML and rule- Number of documents and total number of sentences for
ITAbased approaches for RRC tasks, and interestingly found RhetRoles and BUILD datasets.
that both approaches can lead to very promising results
with a small dataset of manually labeled sentences.
      </p>
      <p>With the advent of deep learning and transformer mod- 3.1.1. ITA-RhetRoles Dataset
els [17], neural methods have been applied to RRC, sig- ITA-RhetRoles is a dataset of civil law Italian legal cases.
nificantly improving the results with respect to previ- This dataset has been kept private as it was built under
ous works. Bhattacharya et al. [18] experimented on a confidentiality agreement between Scuola Superiore
cross-jurisdictional legal documents datasets with var- Sant’Anna and some Italian courts. ITA-RhetRoles
conious models including Hierarchical BiLSTM and GRU sists of approximately 1,500 Italian legal documents, split
with the addition of a CRF and with the integration of an into train, validation, and test set using the year and the
attention mechanism. They compared these models with subject of the case as stratification keys. Figure 1 shows
LEGAL-BERT [12], a legal-domain-specific pre-trained the dataset distribution in terms of documents length:
transformer, which outperformed the other traditional the longest document of the dataset consists of 248
senmachine learning algorithms, suggesting to investigate tences. The labelled rhetorical roles are the 5 most
comfurther in the direction of transformers applications to mon sections of an Italian civil judgment: ”Introduction”
RRC in the legal domain. (INT), ”Conclusions of the parties” (CP), ”Summary of the</p>
      <p>Some other works have shown how hierarchical trans- appealed judgment” (SAJ), ”Legal reasons” (LR), and
”Deformers architectures can be employed to improve the cisional content” (DC). These labels were extracted using
regular expressions to identify the diferent sections in
1https://github.com/GM862001/RhetoricalRolesClassification the collected documents. Handmade validation was then</p>
      <sec id="sec-2-1">
        <title>3.1.2. BUILD Dataset</title>
        <sec id="sec-2-1-1">
          <title>BUILD dataset is a corpus of legal judgment documents</title>
          <p>from the Supreme Court of India, High Courts in diferent
Indian states and some district-level courts. It consists
of a publicly released train and validation set2 and a
private test set. We used the public validation set as test set
and split the original train set into a train and validation
set. Figure 2 shows the dataset distribution in terms of
documents length: the longest document of the dataset
consists of 386 sentences. The labelled rhetorical roles
are 13: ”Preamble” (PRE), ”Facts” (FAC), ”Ruling by Lower
Court” (RLC), ”Issues” (ISSUE), ”Argument by petitioner”
(ARGP), ”Argument by respondent” (ARGR), ”Analysis”
(ANA), ”Statute” (STA), ”Precedent relied” (PRER),
”Precedent not relied” (PRENR), ”Ratio of the decision” (RAT),
”Ruling by Present Court” (RPC), ”None of the others”
(NONE).</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2https://github.com/Legal-NLP-EkStep/rhetorical-role-baseline</title>
          <p>3.2. TransformerOverBERT (ToBERT)
TransformerOverBERT (ToBERT) has a hierarchical
architecture, shown in figure 3, based on the stacking of
the following components: a BERT token-level encoder,
a sentence-level positional encoder, a sentence-level
encoder, and a prediction layer. The processing of a legal
case starts with splitting the raw text of the document
into sentences and tokenizing them. Each sentence is fed
to the BERT token-level encoder and the pooled output
for that sentence, i.e. the hidden representation of the
[CLS] token output by BERT, is extracted.</p>
          <p>The pooled outputs are gathered and fed to the
positional layer to create a position-dependent encoding of
each sentence in the document. These are then input into
the sentence-level encoder. The output representations
of this layer are finally fed to the prediction layer for
rhetorical roles classification. Each of these components
is described in the following sections.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.2.1. Data Preprocessing</title>
        <p>Before being input to ToBERT, each document is split into
sentences. Each sentence is tokenized using the
tokenlevel BERT tokenizer and then padded or truncated to a
certain number  of tokens. Documents are also padded
with null sentences up to the length of the longest
document of the train set (e.g., 386 for BUILD dataset and 284
for ITA-RhetRoles dataset), so to have a batch of input
documents ℐ ∈ ℝ×× × , where  is the number of
documents,  is the number of sentences for each document,
and  is the size of the token embeddings.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.2.2. BERT Token-Level Encoder</title>
        <p>Bidirectional Encoder Representations from
Transformers (BERT) is a neural model based on the transformer
architecture [17]. It uses self-attention, residual
connections, and layer normalization to achieve state-of-the-art
results in many diferent tasks, with the addition of a
task-specific output layer as the only modification to the
model architecture [22].</p>
        <sec id="sec-2-3-1">
          <title>BERT-like models are usually pre-trained via self</title>
          <p>supervised methods on large unlabelled corpora and then
ifne-tuned for the specific task in a supervised fashion.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Our approach is not diferent, in that we leverage two</title>
          <p>diferent pre-trained BERT models: Italian-LEGAL-BERT
trained on huge legal datasets consisting of Italian and</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>English cases respectively: our training process aimed</title>
          <p>only to fine-tune them for our RRC use case.</p>
          <p>Specifically, it is used to obtain the hidden token
representation []
it is fed with 
of each batch sequence. It means that
batches of sentences  ∈ ℝ × ×
and
produces as output a set of  document representations
ℝ× , where  is the hidden size of the specific BERT
model used (e.g., 768 for LEGAL-BERT).</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.2.3. Sentence-Level Positional Encoder</title>
        <sec id="sec-2-4-1">
          <title>A specific positional encoder is used to add a piece of information to the representation of each sentence about its position in the document.</title>
          <p>In this work, we focus on Sinusoidal Positional
Embeddings. Let’s define the input document length (i.e. the
bedding dimension as  (e.g., 768 for Legal-BERT). For
the t-th sentence representation  ∈ ℝ  of a document
(with 0 ≤  &lt;  ), the output of a sinusoidal positional
encoder is:</p>
          <p>′ =  +   ,
  ∈ ℝ (for 0 ≤  &lt;  ) is given by:</p>
          <p>where the  -th component of the embedding vector
where
egy.</p>
          <p>and   are weights that depend on the embedding
strat</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>We tried two diferent approaches. The first one, which</title>
          <p>we named Absolute Positional Embedding, is the same
used in the original Transformer architecture [17], and
uses the weights   =    =  . The second one is a novel
embedding strategy that we named Relative Positional
[9] and LEGAL-BERT [12]. Both these models are pre- fact a correlation between the rhetorical role of a
sen</p>
          <p>In ToBERT, BERT is used as a token-level encoder. piece of information to the positional encoding of a
sennumber of sentences in the document) as  and the em- account the relationships between the sentences of each
  = {
sin(    ), if  is even
cos(    ), if  is odd
  =</p>
          <p>1
10000 
2
.</p>
          <p>Embedding. This one takes into account the relative
posi</p>
          <p>tion of a sentence in the document and uses the weights
  =    = 1000 , where  is the length of the document to
which the sentence belongs3. Basically, instead of
encoding the absolute position of a sentence, this embedding
strategy encodes the relative position of that sentence
with respect to the length of the document in thousandths
(‰), using standard Sinusoidal Positional Embeddings.</p>
          <p>The idea behind this approach is that legal documents
often have a repetitive rhetorical structure (introductory
sentences always come first, followed by sentences
summarizing the final decision, and so on). By a preliminary
explorative data analysis we found that there exists in
tence and its relative position in the document. This
dependency might rely on the specific language and
legal field of the document, but for sure including such a
tence might add valuable hints for its correct rhetorical
role classification.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>3.2.4. Sentence-Level Encoder</title>
        <p>The sentence-level encoder is a transformer model [17]
with the same configuration of the transformer encoders
of the token-level BERT encoder (768 hidden dimensions,
12 attention heads, GELU activation function, and so on),
but with only 2 stacked encoder-layers. It is used to
process the batch of document representations  ∈ ℝ ××
output by the positional encoder. The output produced
by this component has the same shape as its input and
is a batch of document representations that takes into
document. The advantage of using a transformer
encoder over recurrent architectures like LSTMs is that of
better capturing long-distance relationships between
sentences, thanks to the multi-head attention mechanism.
This algorithm involves four main steps:
1. Input: a document representation  ∈ ℝ ×
where  is the number of sentences of  and  is
the model’s hidden size.
2. Linear transformations: the attention function
64( /</p>
        <p>ℎ).
is applied in parallel using  ℎ = 12 attention
heads. For each attention head  ,  is projected
into three diferent spaces: the key space   ∈
ℝ×  , the query space   ∈ ℝ×  , and the value
space   ∈ ℝ×  . These projections are
computed using learned weight matrices    ∈ ℝ ×  ,
   ∈ ℝ ×  , and    ∈ ℝ ×  , where   =   =
3. Scaled dot-product attention and softmax : for each
attention head  , the attention scores are
computed by taking the dot product of query   and</p>
        <sec id="sec-2-5-1">
          <title>3We do not take into account the padding sentences here.</title>
          <p>= softmax (   ×   ⊤ )</p>
          <p>√ 
4. Output: after computing the output   for each
attention head, these are concatenated along</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>Multi-Head Attention layer.</title>
          <p>their last dimension. Finally, a linear
projection is applied using a learned weight matrix
  ∈ ℝ ℎ×  × to obtain the final output of the</p>
          <p>= concat( 1,  2, ...,  ℎ ) 
This sentence-level multi-head attention mechanism
allows the model to capture diferent types of relationships
between sentences by learning separate attention
patterns for each head. Instead, stacking multiple encoder
layers allows to learn increasingly abstract
representations of the input sequence.
chitecture, this model includes a dropout layer as a
regularization technique to prevent overfitting. Dropout
randomly shuts down some of the neurons in the network
during training, sampling from a Bernoulli distribution
with some probability  (which is equal to 0.1 in case
of BERT), forcing the remaining neurons to learn more
robust features that are not dependent on the presence
of other units.
3.2.5. Prediction Layer
sentations  ′ ∈ ℝ××
mension   , and then applying a softmax function
to normalize the scores. Finally, the normalized
scores are multiplied by the value matrix   to
obtain the attention output matrix   :
key   , scaling by the square root of the key di- 4.1. Models
Similar to the transformer encoders used in BERT ar- 4.2. Training and Hyperparameters</p>
        </sec>
        <sec id="sec-2-5-3">
          <title>We used Italian-LEGAL-BERT [9] and LEGAL-BERT</title>
          <p>[12] (the baselines models) to provide a baseline
respectively for ITA-RhetRoles and BUILD datasets.</p>
        </sec>
        <sec id="sec-2-5-4">
          <title>Specifically, each of them was chosen as the encoder</title>
          <p>of an AutoModelForSequenceClassification from
HuggingFace Transformers Python package [23]. We
coupled each model with the relative AutoTokenizer,
and we applied truncation and padding using  = 64 as
the max sentence length. As described in section 3.2.1,
we also padded the documents with null sentences up
to the length of the longest document for each dataset
(386 sentences for BUILD dataset, 284 sentences for
ITA</p>
        </sec>
        <sec id="sec-2-5-5">
          <title>RhetRoles dataset).</title>
        </sec>
        <sec id="sec-2-5-6">
          <title>After having set a baseline for both datasets, we used the very same BERT models as the token-level encoders of ToBERT, and used ToBERT itself as the encoder of an</title>
          <p>AutoModelForTokenClassification, keeping the same
tokenizers and same truncation max sequence length. As
sentence-level encoder we used 2 stacked encoder layers
from PyTorch transformer model.</p>
          <p>Fine-Tuning
We trained all our models using a PyTorch linear
scheduler based on AdamW optimizer, leveraging the
Gradient Scaler from the CUDA Automatic Mixed Precision
package. When training the baseline models we set the
batch size to 128, while we used one document batches
to train ToBERT. In both cases, we accumulated
gradients every 3 steps. We set a maximum number of epochs
to 20, but contextually using early stopping with 2
patience steps. All other relevant hyperparameters were
The prediction layer input is the batch of document repre- fine-tuned.
as output by the sentence-level</p>
        </sec>
        <sec id="sec-2-5-7">
          <title>We used Optuna Python package for hyperparameters</title>
          <p>encoder. This is fed to a linear layer with  output units, fine-tuning [ 24]. This is an automated and eficient
op being the number of labels (i.e. rhetorical roles), and
timization framework ofering a versatile define-by-run
then goes through a dropout layer for regularization pur- API for the hyperparameters space.
is the rhetorical roles</p>
          <p>When training our baseline models we considered the
following hyperparameters space:
poses. The final output  ∈ ℝ ××
classification logits. If labels are provided (e.g. during
training) this layer computes and returns the cross
entropy loss between the logits and the labels, filtering out
the inactive tokens (i.e. the padding ones).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>Our experiments aimed to provide a baseline for both</title>
      </sec>
      <sec id="sec-3-2">
        <title>ITA-RhetRoles and BUILD datasets using legal-domain</title>
        <p>specific BERT models and improve them using
LEGAL</p>
      </sec>
      <sec id="sec-3-3">
        <title>ToBERT. When evaluating our models we considered</title>
        <p>the following metrics: accuracy, Matthew Correlation</p>
      </sec>
      <sec id="sec-3-4">
        <title>Coeficient (MCC), micro and macro precision, micro and macro recall, micro and macro F1.</title>
        <p>• Learning rate ∈ [5 − 6, 5 − 4] ;
• Weight decay ∈ [1 − 3, 1 − 1] .</p>
        <p>To these hyperparameters, we added the following ones
when training ToBERT:
• Sentence-level positinal embedding strategy</p>
        <p>(S_lv_ pos_emb): either absolute or relative;
• Sentence-level</p>
        <p>encoder
(S_lv_enc_dropout) ∈ [0.1, 0.7];
• Sentence-level encoder feed-forward network
size (S_lv_enc_FFN_size) ∈ 50, 51, ..., 1000.</p>
        <p>dropout</p>
        <p>We used TPE (Tree-structured Parzen Estimator) al- both in overall terms and with respect to each considered
gorithm proposed by Bergstra et al. [25] for hyperpa- rhetorical role.
rameters optimization. This method has been shown
to outperform many competiting ones, including ran- 5.1. ITA-RhetRoles
dom search and grid search, in terms of eficiency and
efectiveness. By fitting two separate Gaussian Mixture Table 3 lists the results of the best models selected by
Models (GMMs) to the best and worst objective values, the hyperparameters fine-tuning process on the
ITATPE estimates the density of the promising and unpromis- RhetRoles test dataset. LEGAL-ToBERT achieves almost
ing regions separately, and guides the search accordingly. perfect score in each considered metric (all of them
alOn each trial, TPE samples a new set of candidate hy- ways remain above 97%), significantly outperforming
perparameters by maximizing the ratio ()/() , where LEGAL-BERT. In particular, LEGAL-ToBERT achieves
() is the density estimate of ”good” hyperparameters macro F1 score of 0.98 and MCC of 0.972, improving the
combinations and () is the density estimate of ”bad” baselines set by LEGAL-BERT by 12% and 21%
respechyperparameters combinations. The candidate hyperpa- tively.
rameters with the highest ratio are then evaluated using
the objective function, and the process is repeated. Metric LEGAL-BERT LEGAL-ToBERT</p>
        <p>For each dataset, we performed 32 search trials min- Accuracy 0.872 0.982
imizing the validation loss, and picked the best model MCC 0.806 0.972
tfeorrsfincaolmtebsintiantgi o.nTfaobrleLEG2 AshLo-BwEsRtTheanbdesLtEhGyApLer-TpoarBaEmReT- F1 MMaiccrroo 00..887782 00..998802
whIetnistrianitneerdesotinnIgTtAo-RnhoetitcReoltehsaatnind BbUotIhLDcadsaetsastehtes.rel- P MMaiccrroo 00..887712 00..997892
aotniveewehmebnetdrdaiinnginsgtrLaEteGgAyLw-TaosBpEreRfTe.reTdhtiso stuhgegaebsstosltuhtee R MMaiccrroo 00..888792 00..998802
efective usefulness of including relative position infor- Table 3
mation in the positional embeddings of the sentences, to Test results for ITA-RhetRoles dataset.
leverage the correlation between this feature and their
rhetorical role, due to the repetitive rhetorical structure
of a legal document as a whole.</p>
        <p>We evaluated our approach for legal RRC both on ITA- Table 5 lists the results of the best models selected by
RhetRoles and BUILD dataset. Our analysis aims to com- the hyperparameters fine-tuning process on the BUILD
pare the results of LEGAL-ToBERT with the baselines test dataset. LEGAL-ToBERT significantly outperforms
provided by vanilla stand-alone LEGAL-BERT models, LEGAL-BERT in each considered metric. In particular,
Dataset</p>
        <p>ITARhet
Roles
BUILD</p>
        <p>Model
LEGAL-BERT
LEGAL-ToBERT
LEGAL-BERT
LEGAL-ToBERT</p>
        <p>Parameter
Learning rate
Weight decay
Learning rate
Weight decay
S_lv_ pos_emb
S_lv_enc_dropout
S_lv_enc_FFN_size</p>
        <p>Learning rate
Weight decay
Learning rate
Weight decay
S_lv_ pos_emb
S_lv_enc_dropout
S_lv_enc_FFN_size</p>
        <p>Value
6.49e-05
5.35e-02
8.32e-05
6.93e-02
relative
0.26
167
7.03e-05
9.16e-02
7.54e-05
8.36e-02
relative
0.13
968</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>We also analyzed the performance of our method on</title>
        <p>each rhetorical role separately. Table 4 shows the
precision, recall, and macro F1 score for each rhetorical role.</p>
        <p>In terms of macro F1 score, LEGAL-ToBERT achieves
better performances for each rethorical role, apart from
introductory sentences, for which the performances of
the two models are comparable. Specifically, the
improvement in terms of macro F1 scores ranges from 3% (DC
decisional sentences) to 16% (SAJ - sentences
summarizing the appealed judgment).</p>
        <p>RR
INT
CP
SAJ
LR
DC
LEGAL-ToBERT achieves macro F1 score of 0.57 and The huge improvement in performances is imputable
MCC of 0.73, improving the baselines set by LEGAL- to the capability of ToBERT models to deal efectively
BERT by 22% and 30% respectively. with long documents, by considering and leveraging the
relationships between the diferent sentences of the same
Metric LEGAL-BERT LEGAL-ToBERT legal judgement. Other than this, the relative positional
Accuracy 0.656 0.785 encoding strategy that we applied in the upper layer of</p>
        <p>MCC 0.559 0.727 our hierarchical transformer allows our approach to take
F1 MMaiccrroo 00..467526 00..577845 ionftothaeccinoduinvtidthuealcosrernetleanticoensbaentwdetehneitrhreerlahteitvoeripcaolsirtoiloens
P MMaiccrroo 00..563526 00..672835 icnortrheectdcolacussmificeantito, nwohficahsepnrtoevnicdee,slefuvertrhagerinhginletgsaflodrotch-e
R MMaiccrroo 00..465576 00..576845 umLeEnGtsArLe-pTeotBitEivReTrhreestourlticsaalrsetrpuacrttuicruel.arly surprising in
Table 5 the case of ITA-RhetRoles dataset. This is most
reasonTest results for BUILD dataset. ably due to the higher amount of data this is composed
of, which allows a complex model like ToBERT to reach</p>
        <p>We also analyzed the performance of our method on and exploit its maximum potential, and to the ease of
each rhetorical role separately. Table 6 shows the preci- this task, given the repetitiveness of the structure of the
sion, recall, and macro F1 score for each rhetorical role. documents used. The results achieved on the BUILD
In terms of macro F1 score, LEGAL-ToBERT outperforms dataset are much worse in absolute terms, due to the
LEGAL-BERT in almost each rhetorical role, apart from greater dificulty of the task (much more labels, much
sentences asserting the petitioner arguments (ARGP), for less data), but the relative improvement introduced by
which, surprisingly, LEGAL-BERT performs 12% better. ToBERT on the baseline is comparable, if not even better,
The two models perform equally well on sentences about with respect to that achieved on ITA-RhetRoles (MCC
not relied precedents (PRENR), sentences presenting the improves by 30% in the case of BUILD and by 21% in that
the issue of the debate (ISSUE) and statute sentences of ITA-RhetRoles).
(STA). The improvement achieved by LEGAL-ToBERT Such promising results invite to employ this model
for all other rhetorical roles ranges from 4% (RPC - ruling architecture to automate RRC in related applications,
givsentences by the present court) to 40% (ARGR - sentences ing high hopes of achieving relevant outcomes in many
asserting the respondent argument). diferent legal document analysis tasks.
5.3. Discussion</p>
      </sec>
      <sec id="sec-4-2">
        <title>Our experiments show that approaches to legal RRC based on LEGAL-ToBERT greatly improve the baselines set by vanilla stand-alone LEGAL-BERT models, in two diferent languages and legal contexts.</title>
        <p>5.4. Limitations</p>
      </sec>
      <sec id="sec-4-3">
        <title>While ToBERT models have shown impressive performance on legal RRC benchmarks, we want to highlight some of their main limitations.</title>
        <p>ToBERT models are computationally expensive.</p>
        <p>ToBERT models rely on a huge number of parameters,
which makes training and fine-tuning much more
computationally expensive than other competitive approaches,
including CRFs and stand-alone BERT models. This can
become a serious limit in terms of scalability and
practicality of use in certain applications. For instance, dealing
with very long documents (e.g., thousands of sentences)
or with documents with very long sentences (e.g., many
houndreds of tokens) could become unfeasible without
very powerful computational resources, both in terms of
time and space complexity.</p>
        <p>ToBERT models require high availability of
annotated data. When running experiments on very small
datasets (less than 100 documents), we did not find any
advantage in using ToBERT compared to vanilla BERT.</p>
        <p>These and other experimental results suggest that the
efectiveness of automated legal RRC using supervised
NLP models is highly afected by the size and complexity
of the dataset and the quality of the annotations. The and 22% and the MCC by 21% and 30% respectively.
need for such approaches to have big and high-quality Future research should aim to extend and improve the
datasets is very restricting, as the availability of such approach proposed to other domains and languages. It is
datasets in the legal context is particularly limited for also important to address the problem of building robust
privacy and discretionality reasons. frameworks in absence of large dataserts, which is most</p>
        <p>ToBERT models do not generalize well to docu- often the case when dealing with the legal domain. On
ments longer than those seen during training. For the other hand, we hope that the constant progress in
architectural reasons, ToBERT models are unable to man- legal NLP will incentivize the collection and the release
age efectively documents longer than those seen during of increasingly large datasets. Finally, our models are
training. publicly available and ready-to-use, and we ourselves</p>
        <p>ToBERT models may lack interpretability. A hi- plan to leverage them to enable and improve many
downerarchical use of transformer-based models introduces stream applications such as summarization and argument
a further layer of complexity which makes it even more mining of legal documents.
challenging to interpret model decisions, leading to
dificoulties in identify and diagnose errors or biases in model
predictions. Acknowledgments
ttrilaiLinnEgeGudAalalLns-TguuopaBpgEoerR-tsT.pLemEciGoficdALeLEl-GsTsAouBLfeE-rBRfETrRomTmomdleiomldseirtlesel,ydwomhniucplhr-e- (TAhgisilewJoursktiicse)pparrtojoefcItt4a, lfiuanndneadtiboynwthiedeIta”lGiaiunsMtiziniaisAtrgyiloef”
makes it dificoult to apply this approach to multilingual Justice.
or cross-language tasks. Deploying such models is not
easy as it requires the fine-tuning of a BERT model us- References
ing a huge amount of legal documents in the considered
language. Still, our hope is that the availability of legal
domain-specific pre-trained models will quickly improve
with time, breaking new grounds in many diferent
languages.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion and Future Work</title>
      <sec id="sec-5-1">
        <title>In this work we introduced LEGAL</title>
        <p>TransformerOverBERT (LEGAL-ToBERT), a novel
approach to legal rhetorical roles classification that
leverages the power of Hierarchical Transformers and
legal-domain-specific BERT models. We also proposed a
novel embedding strategy for the top layer encoder of
LEGAL-ToBERT, based on the sinusoidal encoding of
the document sentences using their relative position in
the document instead of the absolute one. Our results
provide evidence that this approach allows for a robust
and efective framework able to classify eficiently the
rhetorical roles of the sentences of long legal documents
by taking into account the relationships between them.</p>
        <p>We tested the efectiveness of LEGAL-ToBERT on two
diferent datasets. The first one is ITA-RhetRoles, a novel
yet confidential dataset, consisting of thousands of
documents from the Italian Civil Court Corpus; the second
one is the BUILD benchmark dataset, composed of a
couple of hundred documents from a various set of Indian
courts. This allowed us to diversify our experiments
in terms of both language and topic. We showed that
LEGAL-ToBERT significantly outperforms vanilla
standalone LEGAL-BERT models, on both ITA-RhetRoles and
BUILD datasets, improving the macro F1 score by 12%</p>
      </sec>
      <sec id="sec-5-2">
        <title>4More information is available at https://www.unitus.it/it/unitus/</title>
        <p>mappatura-della-ricerca/articolo/giustizia-agile.
ference on Semantic Computing (ICSC), IEEE, 2020, mation processing systems 30 (2017).
pp. 464–467. [18] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh,
[8] S. Ghosh, A. Wyner, Identification of rhetorical A. Wyner, Deeprhole: deep learning for rhetorical
roles of sentences in indian legal judgments, in: role labeling of sentences in legal case documents,
Legal Knowledge and Information Systems: JURIX Artificial Intelligence and Law (2021) 1–38.
2019: The Thirty-second Annual Conference, vol- [19] J. Lu, M. Henchion, I. Bacher, B. M. Namee, A
ume 322, IOS Press, 2019, p. 3. sentence-level hierarchical bert model for
docu[9] D. Licari, G. Comandè, ITALIAN-LEGAL-BERT: ment classification with limited labelled data, in:
A Pre-trained Transformer Language Model for Discovery Science: 24th International Conference,
Italian Law, in: D. Symeonidou, R. Yu, D. Ce- DS 2021, Halifax, NS, Canada, October 11–13, 2021,
olin, M. Poveda-Villalón, D. Audrito, L. D. Caro, Proceedings 24, Springer, 2021, pp. 231–241.
F. Grasso, R. Nai, E. Sulis, F. J. Ekaputra, O. Kutz, [20] I. Chalkidis, X. Dai, M. Fergadiotis, P. Malakasiotis,
N. Troquard (Eds.), Companion Proceedings of D. Elliott, An exploration of hierarchical attention
the 23rd International Conference on Knowledge transformers for eficient long document
classificaEngineering and Knowledge Management, vol- tion, arXiv preprint arXiv:2210.05529 (2022).
ume 3256 of CEUR Workshop Proceedings, CEUR, [21] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel,
Bozen-Bolzano, Italy, 2022. URL: https://ceur-ws. N. Dehak, Hierarchical transformers for long
docuorg/Vol-3256/#km4law3, iSSN: 1613-0073. ment classification, in: 2019 IEEE automatic speech
[10] P. Henderson, M. S. Krass, L. Zheng, N. Guha, recognition and understanding workshop (ASRU),
C. D. Manning, D. Jurafsky, D. E. Ho, Pile of IEEE, 2019, pp. 838–844.
law: Learning responsible data filtering from the [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
law and a 256gb open-source legal dataset, 2022. Bert: Pre-training of deep bidirectional
transformarXiv:2207.00220. ers for language understanding, arXiv preprint
[11] A. Chriqui, I. Yahav, I. Bar-Siman-Tov, Legal hebert: arXiv:1810.04805 (2018).</p>
        <p>A bert-based nlp model for hebrew legal, judicial [23] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
Deand legislative texts, SSRN preprint:4147127 (2022). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Fun[12] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
tras, I. Androutsopoulos, LEGAL-BERT: The mup- Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
pets straight out of law school, in: Findings M. Drame, Q. Lhoest, A. M. Rush, Transformers:
of the Association for Computational Linguistics: State-of-the-art natural language processing, in:
EMNLP 2020, Association for Computational Lin- Proceedings of the 2020 Conference on Empirical
guistics, Online, 2020, pp. 2898–2904. URL: https:// Methods in Natural Language Processing: System
aclanthology.org/2020.findings-emnlp.261. doi:10. Demonstrations, Association for Computational
18653/v1/2020.findings- emnlp.261. Linguistics, Online, 2020, pp. 38–45. URL: https://
[13] B. Hachey, C. Grover, A rhetorical status classifier www.aclweb.org/anthology/2020.emnlp-demos.6.
for legal text summarisation, in: Text Summariza- [24] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama,
tion Branches Out, 2004, pp. 35–42. Optuna: A next-generation hyperparameter
opti[14] M.-F. Moens, E. Boiy, R. M. Palau, C. Reed, Au- mization framework, in: Proceedings of the 25th
tomatic detection of arguments in legal texts, in: ACM SIGKDD International Conference on
KnowlProceedings of the 11th International Conference edge Discovery and Data Mining, 2019.
on Artificial Intelligence and Law, ICAIL ’07, Asso- [25] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl,
ciation for Computing Machinery, New York, NY, Algorithms for hyper-parameter
optimizaUSA, 2007, p. 225–230. URL: https://doi.org/10.1145/ tion, in: J. Shawe-Taylor, R. Zemel, P. Bartlett,
1276318.1276362. doi:10.1145/1276318.1276362. F. Pereira, K. Weinberger (Eds.), Advances in
[15] M. Saravanan, B. Ravindran, Identification of Neural Information Processing Systems,
volrhetorical roles for segmentation and summariza- ume 24, Curran Associates, Inc., 2011. URL: https:
tion of a legal judgment, Artificial Intelligence and //proceedings.neurips.cc/paper_files/paper/2011/
Law 18 (2010) 45–76. file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf.
[16] V. Malik, R. Sanjay, S. K. Guha, A. Hazarika,</p>
        <p>S. Nigam, A. Bhattacharya, A. Modi, Semantic
segmentation of legal documents via rhetorical roles,
2022. arXiv:2112.01836.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,</p>
        <p>L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attention is all you need, Advances in neural
infor</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farzindar</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Lapalme, Letsum, an automatic legal text summarizing system</article-title>
          ,
          <source>JURIX</source>
          (
          <year>2004</year>
          )
          <fpage>11</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>I. Nejadgholi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bougueng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Witherspoon</surname>
          </string-name>
          ,
          <article-title>A semisupervised training method for semantic search of legal facts in canadian immigration cases</article-title>
          ., in: JURIX,
          <year>2017</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Segmenting us court decisions into functional and issue specific parts</article-title>
          .,
          <source>in: JURIX</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hachey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          , Extractive summarisation of legal texts,
          <source>Artificial Intelligence and Law</source>
          <volume>14</volume>
          (
          <year>2006</year>
          )
          <fpage>305</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kalamkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Modi</surname>
          </string-name>
          ,
          <article-title>Corpus for automatic structuring of legal documents</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>4420</fpage>
          -
          <lpage>4429</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          . lrec-
          <volume>1</volume>
          .
          <fpage>470</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pillaipakkamnatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Linares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Pesce</surname>
          </string-name>
          ,
          <article-title>Automatic classification of rhetorical roles for sentences: Comparing rulebased scripts with machine learning</article-title>
          .,
          <source>ASAIL@ ICAIL</source>
          <volume>2385</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sahibzada</surname>
          </string-name>
          ,
          <article-title>Understanding legal documents: classification of rhetorical role of sentences using deep learning and natural language processing</article-title>
          , in: 2020 IEEE 14th International Con-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>