<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Efect of Hierarchical Domain-specific Language Models and Attention in the Classification of Decisions for Legal Cases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nishchal Prasad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohand Boughanem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taoufiq Dkaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut de Recherche en Informatique de Toulouse (IRIT)</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In order to automate a judicial process, a model to accurately predict the best probable decision of a legal case from the facts is desired. We try to explore this task of decision prediction on unannotated and unstructured large legal documents with only the results of the decision. For this task, we explored many available deep learning architectures including transformer-based language models (BERT, XLNet), domain-specific language model (LEGAL-BERT), attention mechanism, and sequence models (LSTM, GRU). With the diferent combinations of these architectures and methods, we ran extensive experiments upon an English legal dataset called ILDC and developed many hierarchical domain-specific language models all of which improves the performance by at least 2 metric points, with the best amongst them giving an improvement of approximately 3 metric points on the previous baseline models on this dataset, showing that the domain-specific models; when fine-tuned; adapts well to a domain of the same nature but with a diferent syntax, lexicon and grammar setting, and improves the performance significantly.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LEGAL-BERT</kwd>
        <kwd>Domain-specific Large Document Classification</kwd>
        <kwd>Legal Case Prediction</kwd>
        <kwd>Large Unstructured Documents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A mechanism to assist judges and courts to reach a conclusion for the outcome of an ongoing
legal case, is sought after for many years [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Also with the One of the major milestones
to develop such a robust mechanism for practical legal assistance is the prediction of court
judgments in a real-life setting, i.e. predicting the best probable decision from only the previous
case arguments and case facts. This can help to propel the slow judicial process which plagues
the judicial system of many countries. One such example can be seen in the Indian judicial
system. 1 A solution to this problem of legal case decision prediction, can also help to cut the
cost of case proceedings for people unfamiliar to intensive judicial system and law articles, by
giving useful decision results and insights into their legal cases. This will give the courts the
required time and space to develop other branches of the judicial process and also de-congest
tribunal court cases.
      </p>
      <p>Since the legal documents are mostly language-oriented in the form of complex legal texts,
the task of decision prediction has been formulated as a text classification task. But as compared
to a general text classification problem this task of legal decision prediction is more complex
and sophisticated. This is due to many reasons involving the unstructured and unannotated
noisy textual representation of legal case proceedings, which makes the process of automatically
extracting the arguments and facts from the case proceedings dificult. The legal text also
difers from a standard text in terms of lexical understanding having a very specific vocabulary
and complex document structure, which requires adapting the pre-trained models (trained on
general text) on legal texts.</p>
      <p>
        In this paper, we try to confront the problem of decision prediction from legal texts by
developing deep learning methods. We aim to predict the final decision of a legal case from
its facts and arguments in unannotated and unstructured legal documents, which replicates
the real-life setting of legal case documents. We work only in the development of a robust
predictor while the work on the explanation of the predictions is underway at the time of
writing this paper. Although this is not a novel task in itself, it is our first step to developing
an architectural model for legal understanding, and decision prediction. We have explored the
efect of a domain-specific language model (LEGAL-BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) over the general ones and have
provided the experimental results for the same. While our work has dealt in the context of legal
texts in the English language, the findings of this work can be leveraged to be adapted to legal
texts in any language, on the condition that there is a suficiently large clean dataset in the
same language for the models and methods to be adapted (trained) on.
      </p>
      <p>The main contributions of this paper is summarized as below:
• Legal judgment prediction model:</p>
      <p>
        We propose a baseline model for legal judgment prediction which hierarchically builds
upon domain-specific BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] known as LEGAL-BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and two-layered Bi-LSTM
with multi-head scaled dot-product attention [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which achieves significant higher metric
scores on the previous baseline models. The model is based on the hypothesis that
a domain specific pre-trained language model is transferable in same domain. This
hypothesis is also supported by the experimental results in the following sections.
• Experimental approaches:
      </p>
      <p>
        We have explored the ILDC dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and experimented with state-of-the-art
architectures involving recursive neural networks (GRU, LSTM, CNN), transformers (BERT,
XLNet) and attention mechanisms on a dataset of large unstructured and unannotated
legal documents.
• Evaluations:
      </p>
      <p>We performed extensive experiments on the ILDC dataset with diferent baseline models
and improved upon their architectures to develop a final proposed baseline architecture;
which achieves a significantly higher metric score upon the task for which the previous
baseline architectures were trained on; showing that fine-tuning of pre-trained
domainspecific language models helps to adapt and give a better understanding of a similar
domain language having diferent lexicon, grammar and syntactical setting.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Several research with the methods of machine learning and deep learning have been conducted in
the past on the problem of automatic predicting the outcome of a legal case, alongside providing
diferent approaches, methods and corpora suited to individual prediction tasks. In 2018 Xiao
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] released the Chinese AI and Law challenge dataset (CAIL2018) for legal judgment
prediction which contains rich annotations for the judgments of more than 2.6 million criminal
cases. This dataset consists of detailed annotations to the related law articles to cases, the prison
terms and the charges. Chalkidis et. al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced a dataset from the case proceedings
of European Cour of Human Rights, in English, where each case has a score which states its
importance. They described a Legal Judgment Prediction (LJP) task for their dataset which
aims to predict the outcome of a legal case with the annotated case facts, and law violations.
For this task, they proposed a hierarchical version of BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to tackle BERT’s limitation
of a fixed number of input tokens. Zhong et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed TOP-JUDGE which formulates
the dependency among the subtasks of legal judgment prediction through Directed Acyclic
Graphs (DAG) by attending to the relation between diferent subtasks of the judgment prediction
through topological multi-task learning. Luo et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] defined a charge prediction task from
the case facts of a Chinese criminal case dataset and proposed an attention-based method to
predict the same along with relevant law articles. Zhong et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed QAjudge, based
on reinforcement learning to predict the outcome of a legal case from the facts by visualizing
the process giving interpretable judgments. Chen et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a Deep Gating Network
(DGN) to predict the prison term for criminals based on the criminal charges and the case facts.
      </p>
      <p>
        While much of the research focuses on the legal case prediction for a specific setting (such as
civil, criminal) with rich annotated cases providing good learning parameters for helping the
decision classification, we focus more on predicting the outcome of the general legal cases from
large unannotated and unstructured legal documents. Malik et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced a dataset named
the Indian Legal Document Corpus (ILDC) and experimented upon it to provide a baseline model
with their Case Judgment Prediction and Explanation (CJPE) task which achieves a macro-F1
score of 77.79% and an accuracy of 78% in the judgment prediction task.
      </p>
      <p>
        CJPE is somewhat similar to our task while we aim to leverage our task to French legal
documents in the future with more focus to cluster the case documents to their specific types.
Because of the similarity of the ILDC dataset with the dataset requirement for our first task (of
predicting decisions from unstructured legal documents), we develop, experiment, and evaluate
our classification models on the ILDC dataset contributed by Malik et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>We formulate this task of legal judgement prediction as a text classification problem, given
below:</p>
      <p>For an unstructured legal case document ‘C’, predict its decision ‘D’ among the two labels
‘accepted’(= 1) and ‘rejected’(= 0), given only the facts of the legal document.</p>
      <p>To move forward with the classification task we experimented with several deep learning
architectures and methods detailed hereafter.
We experimented with some of the Recursive Neural Networks (RNN) such as GRU [12] and
LSTM [13] with bidirectional nature [14] to process the sequence information in both forward
and backward directions. Since the ILDC dataset consists of large documents of variable lengths,
each having several sentences (tens of thousands of tokens in total), it becomes computationally
complex and expensive to process and determine the embeddings of all individual words as a
sequence of sequence i.e. words in sentences in a document for all documents. So instead we
resort to encoding the sentences as sequences in a document (i.e sequence of encoded/vectorized
sentences). To encode the sentences in the documents we have used separately two
state-ofthe-art pre-trained sentence encoders namely Universal Sentence Encoder [15] and S-BERT
tokenizer [16], trained on general texts. We divided the documents into chunks (with the idea
that these chunks can be treated as a near estimate for the sentences in the documents) with
overlaps to count for the miss in the sentence breaks while dividing/chunking. These are passed
into the encoders to obtain the chunk embeddings. These chunk embeddings for a document
are concatenated together for further processing.</p>
      <p>
        We used Bi2-LSTM3 (or Bi2-GRU4) with two layers and dropouts in between with further
feed-forward layers for classification.
3.2. Transformer Encoders:
Pre-trained transformer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] encoders such as BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], XLNet [17] have shown significant
improvements in language modeling and understanding and can be adapted for downstream
tasks with fine-tuning of the internal kernel weights or pre-training it altogether in a
domainspecific task either from scratch or from a previous pretrained checkpoint. In our work we have
experimented with BERT-base5 and XLNet-base6 trained on general text by fine-tuning on the
training set. We have used max-pooling on the output of the final layer to get a document-level
representation as an input to a feedforward network for classification. Since the text in the
legal domain has a specific lexicon, vocabulary and difers in syntax from the general text, the
sentence and document embeddings generated from the models pre-trained on general text may
not properly adapt to the domain-specific context. Hence we also tried to check this argument
with a BERT model pre-trained on legal text, known as LEGAL-BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The same architecture
of max-pooling and feedforward network was used to compare with the results from previous
models of BERT and XLNet, which can be found in the Table 3. A document is divided into
smaller chunks with overlap (as in Section 3.1), each having 512 tokens including the [CLS] and
[SEP] token [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These chunks are then passed into the tokenizer (of the respective transformer
encoder model), and from its output embeddings, the CLS tokens are extracted and taken as the
vectorized representation for the chunks. These are then concatenated together to form the
tokenized representation for the document to be used as the input for the transformer encoder
model.
2www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional
3www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
4www.tensorflow.org/api_docs/python/tf/keras/layers/GRU
5https://huggingface.co/docs/transformers/model_doc/bert
6https://huggingface.co/docs/transformers/model_doc/xlnet
We used a hierarchical transformer method taking the idea from [18]. The document is divided
into chunks (with or without overlaps) of a fixed length of 512 tokens including the [CLS] and
[SEP] tokens. Each chunk is passed into the tokenizer to obtain the tokenized representation to
be used as input to the respective transformer encoder model. The output from the last layer of
the transformer encoder model is max-pooled to get the [CLS] representation for the chunk.
Each of these [CLS] representations are accumulated together to form the new sequence to be
used as the embedding for further processing with sequence encoder layers(Bi-GRU, BiLSTM,
etc.) for classification. The details of the model architecture for the hierarchical transformer
can be seen in the Table 1. LEGAL-BERT fine-tuned on ILDC  is used to extract the [CLS]
representations, owing to its better performance as compared to other transformer architectures
(Table 3). Also, it can be argued that even though the LEGAL-BERT is pretrained on US/EU legal
texts and not on the Indian legal texts (which difer in lexicons and syntax) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], its fine-tuned
model can be adapted to the respective setting in the same way as other pre-trained models
trained on general texts are used (with fine-tuning) for domain-specific downstream tasks (as
can be seen in the experimental results in Table 3). In general, we experimented with two
diferent types of setup in this architecture:
• Without attention: The accumulated [CLS] vectors are taken as embedding inputs to
the sequence models used in Section 3.1, which consists of the general setup of two layers
of either Bi-GRU, Bi-LSTM or their combination. Dropouts we also introduced between
the bidirectional layers to increase randomization and prevent overfitting.
• With attention: Dot-product attention [19] was used with the Bi-LSTM (layer 2) output
as the query and key-value pair. Multi-head scaled dot-product attention [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with the
outputs of Bi-LSTM were also used. We used diferent combinations of query and key
value pairs for the multi-head scaled dot-product attention which are:
– The accumulated [CLS] representations for a document as the query, and the output
from Bi-LSTM (layer 1) as the key-value pair.
– The output from Bi-LSTM (Layer 1) as the query, and Bi-LSTM (layer 2) output as
the key-value pair.
      </p>
      <p>– Bi-LSTM (layer 2) output as the query and key-value pair.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup and Hyperparameters</title>
      <p>For all the experiments and architecture development we used the Tensorflow 7 framework,
pandas 8 and NumPy 9 library. The pre-trained transformer models were taken from the
HuggingFace10 library. The experiments were run on Colab11 with an Nvidia12 Tesla P100(16GB)
7www.tensorflow.org
8https://pandas.pydata.org/
9https://numpy.org/
10https://huggingface.co/
11https://colab.research.google.com/
12https://www.nvidia.com/
GPU. For all the experiments  activation was used for classification in the last layer.
’Relu’ activation function was chosen for the hidden feed-forward layers, while the sequence
models use the ℎ activation function. Adam [20] was used as the optimization algorithm for
training. As this is a problem of binary classification we use ‘binary cross-entropy’ as the loss
function. To train the models we reduce the learning rate by a factor of 0.95 based on the updates
on the monitored metric with patience in two epochs.13 All the transformer models of Section 3.2
were fine-tuned for two epochs with a batch size of 10 documents. The hierarchical transformer
architectures in Section 3.3, were trained on a batch size of 32 documents. Architectural specific
details and other hyper-parameters can be found in Table 1.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Dataset description</title>
      <p>
        We used the dataset14 introduced by Malik et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which contains the case proceedings
from the Supreme Court of India. For whether a claim(s) is ‘accepted’ or ‘rejected’ for a case(s)
ifled by the appellant in the Supreme Court of India is decided by a jury, which is taken as the
label for the respective legal case document in the dataset. These labels are used to train the
models/architectures in the experiments. The dataset has two parts ILDC and ILDC.
ILDC consists of those case proceedings for which there is a single decision for a petition or
a same decision across all the multiple petitions. While the documents in ILDC are the more
common case of case proceedings that involve multiple petitions with diferent decisions. The
labeling of the documents in ILDC is taken as it is (stating the fact that computing multiple
decisions for multiple petitions is computationally complex and expensive) where the label is
set to be ‘accepted’ class if a single petition among the multiple appeals is ‘accepted’ otherwise
it is set to the ‘rejected’ class. The dataset statistics are given in Table 2. We experiment with
the same subsets of the dataset for training, validation, and testing as provided by the authors
to maintain consistency in the experimental results and compare on the same test cases across
all the experiments for decision classification.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Results and discussion</title>
      <p>
        To measure the model performance we used the macro-precision, macro-recall, and macro-F1
scores as our performance metrics in order for the results to be comparable to the previous
models on the same dataset. In Table 3 we omit the results of the pre-trained transformer
models trained on ILDC, since we only use the pre-trained transformer models finetuned
on ILDC for further development of the hierarchical models. Also since the number of
13https://keras.io/api/callbacks/reduce_lr_on_plateau/
14The dataset can be requested from its original authors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We do not have the rights to circulate this dataset.
      </p>
      <p>Models
Sequence-to-sequence RNN encoders
(train set = ILDC)</p>
      <p>Universal Sentence Encoder + BiLSTM
Universal Sentence Encoder + BiGRU + Dropout(0.01)</p>
      <p>S-BERT embeddings + BiLSTM</p>
      <p>S-BERT embeddings + BiLSTM + Dropout(0.01)
Sequence-to-sequence RNN encoders
(train set = ILDC)</p>
      <p>Universal Sentence Encoder + BiLSTM
Universal Sentence Encoder + BiGRU + Dropout(0.01)</p>
      <p>S-BERT embeddings + BiLSTM</p>
      <p>S-BERT embeddings + BiLSTM + Dropout(0.01)
Pre-Trained Transformer Encoders (train set = ILDC)</p>
      <p>BERT</p>
      <p>XLNet</p>
      <p>LEGAL-BERT
Hierarchical Transformers (train set = ILDC):</p>
      <p>Bi-GRU</p>
      <p>Bi-GRU
Bi-LSTM + Bi-GRU</p>
      <p>Bi-LSTM</p>
      <p>
        Bi-LSTM + Dropout(0.01)
training instances is much less in ILDC the fine-tuned transformer models yield less
understanding as compared to ILDC. As can be seen in Table 3, the sequence models
with the pre-trained encoders (Universal Sentence Encoder and S-BERT encoder) have poor
performance in all the performance metrics. This can be accounted for by the fact that these
encoders are not fine-tuned during the model training process and also their embeddings are
more aligned to general texts rather than on the domain-specific to the legal texts. Even so,
the embeddings from the Universal Sentence Encoder give slightly better performance than
S-BERT embeddings both in ILDC and ILDC, without any architectural modifications
(i.e. dropouts) to the baseline RNN layer. The pre-trained transformer models trained on general
English texts improved the metric scores with BERT achieving a F1 score of 0.6322, and XLNet
achieving a F1 score of 0.7103, while the domain-specific LEGAL-BERT model (pre-trained on
legal texts) gives the best results, of ≈ 4% increase over the XLNet model. These improvements
in the metrics helped us choose LEGAL-BERT to be used as the base layer for our hierarchical
transformer models. Bi-GRU over LEGAL-BERT was taken as the baseline model which shows
a significant performance improvement over the previous models experimented on this dataset
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as can be seen in the Table 3. With Bi-LSTM there is a slight improvement in the metric
scores in both ILDC and ILDC. Adding dropouts over the Bi-LSTM layers results
in a decrease in the performance in ILDC for the same number of epochs (= 6) but an
improvement in ILDC. Since ILDC is a small set as compared to ILDC, adding
dropouts slows down the model’s ability to converge to the optimum boundary. Hence we
trained this model for two more epochs to result in the performance improvement to 0.8084
F1 score in ILDC (Table 3). There was a marginal decrease in the metrics by using the
dot-product attention while using the multi-head attention (with the query and key, value
combination as shown in Table 3) resulting in slight performance improvements to 0.8070 and
0.8125 F1 scores in the test-set of ILDC and ILDC respectively, for the hierarchical
transformer model.
      </p>
      <p>This shows that the dot-product attention and multi-head scaled dot-product attention
mechanism used here do not improve the performance significantly. This can be pointed to
the fact that the [CLS] embeddings used for the sequence models in hierarchical transformer,
already contains the learnt representations from the internal multi-head attention function of the
transformer architecture. To see if other novel attention mechanisms improve the performance
of the hierarchical transformers is yet to be explored.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>
        In this paper, we have explored the problem of decision classification of large unstructured and
unannotated legal documents. We aim to formulate this problem as a decision prediction of
legal case documents in real-life scenarios. To experiment with our models we used the ILDC
dataset. We explored various state-of-the-art pre-trained language models (BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], XLNet
[17], LEGAL-BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), attention mechanisms, and sequence models (LSTM, GRU) for decision
prediction tasks on the ILDC dataset. Based upon their performance we developed several
baseline hierarchical domain-specific transformer models which improve significantly in the
performance metrics of the previous models trained on the ILDC dataset. Our experiments
show that LEGAL-BERT (a pre-trained domain-specific language model which is trained on the
legal texts of Europe Union and United States court proceedings, each having their own specific
legal terms, syntax and grammar), when fine-tuned on the legal case texts of The Supreme
Court of India, adapts well to the grammar, lexicon, and syntax of Indian legal system. This
ifnding shows that the domain-specific pre-trained language models can be adapted well to the
same domain with diferent language setting (syntax, grammar, lexicon). We aim to leverage
this work on the prediction and classification of French legal cases in the future.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the LawBot project, granted by ANR the French Agence Nationale
de la Recherche.
deep gating network, CoRR abs/1908.11521 (2019). URL: http://arxiv.org/abs/1908.11521.
arXiv:1908.11521.
[12] K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the properties of neural machine
translation: Encoder-decoder approaches, CoRR abs/1409.1259 (2014). URL: http://arxiv.
org/abs/1409.1259. arXiv:1409.1259.
[13] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (1997)
1735–1780. URL: https://doi.org/10.1162/neco.1997.9.8.1735. doi:10.1162/neco.1997.9.
8.1735.
[14] M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on</p>
      <p>Signal Processing 45 (1997) 2673–2681. doi:10.1109/78.650093.
[15] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M.
GuajardoCespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, R. Kurzweil, Universal sentence encoder,
2018. arXiv:1803.11175.
[16] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,</p>
      <p>CoRR abs/1908.10084 (2019). URL: http://arxiv.org/abs/1908.10084. arXiv:1908.10084.
[17] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
autoregressive pretraining for language understanding, CoRR abs/1906.08237 (2019). URL:
http://arxiv.org/abs/1906.08237. arXiv:1906.08237.
[18] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, N. Dehak, Hierarchical transformers for long
document classification, CoRR abs/1910.10781 (2019). URL: http://arxiv.org/abs/1910.10781.
arXiv:1910.10781.
[19] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and
translate, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning
Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
2015. URL: http://arxiv.org/abs/1409.0473.
[20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun
(Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego,
CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/
1412.6980.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Segal</surname>
          </string-name>
          ,
          <article-title>Predicting supreme court cases probabilistically: The search</article-title>
          and seizure cases, 1962-
          <fpage>1981</fpage>
          , American Political Science Review
          <volume>78</volume>
          (
          <year>1984</year>
          )
          <fpage>891</fpage>
          -
          <lpage>900</lpage>
          . doi:
          <volume>10</volume>
          .2307/1955796.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>LEGALBERT: the muppets straight out of law school</article-title>
          , CoRR abs/
          <year>2010</year>
          .02559 (
          <year>2020</year>
          ). URL: https: //arxiv.org/abs/
          <year>2010</year>
          .02559. arXiv:
          <year>2010</year>
          .02559.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv. org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>CoRR abs/1706</source>
          .03762 (
          <year>2017</year>
          ). URL: http: //arxiv.org/abs/1706.03762. arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanjay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Modi</surname>
          </string-name>
          ,
          <article-title>ILDC for CJPE: indian legal documents corpus for court judgmentprediction and explanation</article-title>
          ,
          <source>CoRR abs/2105</source>
          .13562 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2105.13562. arXiv:
          <volume>2105</volume>
          .
          <fpage>13562</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Xu,</surname>
          </string-name>
          <article-title>CAIL2018: A large-scale legal dataset for judgment prediction</article-title>
          , CoRR abs/
          <year>1807</year>
          .02478 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1807</year>
          .02478. arXiv:
          <year>1807</year>
          .02478.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          , I. Androutsopoulos,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <article-title>Neural legal judgment prediction in english</article-title>
          , CoRR abs/
          <year>1906</year>
          .
          <year>02059</year>
          (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1906</year>
          .
          <year>02059</year>
          . arXiv:
          <year>1906</year>
          .
          <year>02059</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Legal judgment prediction via topological learning</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>3540</fpage>
          -
          <lpage>3549</lpage>
          . URL: https://aclanthology.org/D18-1390. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1390.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Learning to predict charges for criminal cases with legal basis</article-title>
          ,
          <source>CoRR abs/1707</source>
          .09168 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707.09168. arXiv:
          <volume>1707</volume>
          .
          <fpage>09168</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Iteratively questioning and answering for interpretable legal judgment prediction</article-title>
          ,
          <source>in: The Thirty-Fourth AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2020</year>
          , The Thirty-Second
          <source>Innovative Applications of Artificial Intelligence Conference</source>
          ,
          <source>IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI</source>
          <year>2020</year>
          , New York, NY, USA, February 7-
          <issue>12</issue>
          ,
          <year>2020</year>
          , AAAI Press,
          <year>2020</year>
          , pp.
          <fpage>1250</fpage>
          -
          <lpage>1257</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/5479.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>Charge-based prison term prediction with</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>