Effect of Hierarchical Domain-specific Language Models and Attention in the Classification of Decisions for Legal Cases Nishchal Prasad1,* , Mohand Boughanem1 and Taoufiq Dkaki1 1 Institut de Recherche en Informatique de Toulouse (IRIT), Toulouse, France Abstract In order to automate a judicial process, a model to accurately predict the best probable decision of a legal case from the facts is desired. We try to explore this task of decision prediction on unannotated and unstructured large legal documents with only the results of the decision. For this task, we explored many available deep learning architectures including transformer-based language models (BERT, XLNet), domain-specific language model (LEGAL-BERT), attention mechanism, and sequence models (LSTM, GRU). With the different combinations of these architectures and methods, we ran extensive experiments upon an English legal dataset called ILDC and developed many hierarchical domain-specific language models all of which improves the performance by at least 2 metric points, with the best amongst them giving an improvement of approximately 3 metric points on the previous baseline models on this dataset, showing that the domain-specific models; when fine-tuned; adapts well to a domain of the same nature but with a different syntax, lexicon and grammar setting, and improves the performance significantly. Keywords LEGAL-BERT, Domain-specific Large Document Classification, Legal Case Prediction, Large Unstructured Documents 1. Introduction A mechanism to assist judges and courts to reach a conclusion for the outcome of an ongoing legal case, is sought after for many years [1]. Also with the One of the major milestones to develop such a robust mechanism for practical legal assistance is the prediction of court judgments in a real-life setting, i.e. predicting the best probable decision from only the previous case arguments and case facts. This can help to propel the slow judicial process which plagues the judicial system of many countries. One such example can be seen in the Indian judicial system. 1 A solution to this problem of legal case decision prediction, can also help to cut the cost of case proceedings for people unfamiliar to intensive judicial system and law articles, by giving useful decision results and insights into their legal cases. This will give the courts the required time and space to develop other branches of the judicial process and also de-congest CIRCLE (Joint Conference of the Information Retrieval Communities in Europe), July 04–07, 2022, Samatan, Gers, France * Corresponding author. $ Nishchal.Prasad@irit.fr (N. Prasad); Mohand.Boughanem@irit.fr (M. Boughanem); Taoufiq.Dkaki@irit.fr (T. Dkaki) € https://www.irit.fr/~Mohand.Boughanem/ (M. Boughanem) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 www.tribuneindia.com/news/archive/comment/backlog-of-cases-crippling-judiciary-776503 tribunal court cases. Since the legal documents are mostly language-oriented in the form of complex legal texts, the task of decision prediction has been formulated as a text classification task. But as compared to a general text classification problem this task of legal decision prediction is more complex and sophisticated. This is due to many reasons involving the unstructured and unannotated noisy textual representation of legal case proceedings, which makes the process of automatically extracting the arguments and facts from the case proceedings difficult. The legal text also differs from a standard text in terms of lexical understanding having a very specific vocabulary and complex document structure, which requires adapting the pre-trained models (trained on general text) on legal texts. In this paper, we try to confront the problem of decision prediction from legal texts by developing deep learning methods. We aim to predict the final decision of a legal case from its facts and arguments in unannotated and unstructured legal documents, which replicates the real-life setting of legal case documents. We work only in the development of a robust predictor while the work on the explanation of the predictions is underway at the time of writing this paper. Although this is not a novel task in itself, it is our first step to developing an architectural model for legal understanding, and decision prediction. We have explored the effect of a domain-specific language model (LEGAL-BERT [2]) over the general ones and have provided the experimental results for the same. While our work has dealt in the context of legal texts in the English language, the findings of this work can be leveraged to be adapted to legal texts in any language, on the condition that there is a sufficiently large clean dataset in the same language for the models and methods to be adapted (trained) on. The main contributions of this paper is summarized as below: • Legal judgment prediction model: We propose a baseline model for legal judgment prediction which hierarchically builds upon domain-specific BERT [3] known as LEGAL-BERT [2] and two-layered Bi-LSTM with multi-head scaled dot-product attention [4], which achieves significant higher metric scores on the previous baseline models. The model is based on the hypothesis that a domain specific pre-trained language model is transferable in same domain. This hypothesis is also supported by the experimental results in the following sections. • Experimental approaches: We have explored the ILDC dataset [5] and experimented with state-of-the-art archi- tectures involving recursive neural networks (GRU, LSTM, CNN), transformers (BERT, XLNet) and attention mechanisms on a dataset of large unstructured and unannotated legal documents. • Evaluations: We performed extensive experiments on the ILDC dataset with different baseline models and improved upon their architectures to develop a final proposed baseline architecture; which achieves a significantly higher metric score upon the task for which the previous baseline architectures were trained on; showing that fine-tuning of pre-trained domain- specific language models helps to adapt and give a better understanding of a similar domain language having different lexicon, grammar and syntactical setting. 2. Related Work Several research with the methods of machine learning and deep learning have been conducted in the past on the problem of automatic predicting the outcome of a legal case, alongside providing different approaches, methods and corpora suited to individual prediction tasks. In 2018 Xiao et al. [6] released the Chinese AI and Law challenge dataset (CAIL2018) for legal judgment prediction which contains rich annotations for the judgments of more than 2.6 million criminal cases. This dataset consists of detailed annotations to the related law articles to cases, the prison terms and the charges. Chalkidis et. al. [7] introduced a dataset from the case proceedings of European Cour of Human Rights, in English, where each case has a score which states its importance. They described a Legal Judgment Prediction (LJP) task for their dataset which aims to predict the outcome of a legal case with the annotated case facts, and law violations. For this task, they proposed a hierarchical version of BERT [3] to tackle BERT’s limitation of a fixed number of input tokens. Zhong et al. [8] proposed TOP-JUDGE which formulates the dependency among the subtasks of legal judgment prediction through Directed Acyclic Graphs (DAG) by attending to the relation between different subtasks of the judgment prediction through topological multi-task learning. Luo et al. [9] defined a charge prediction task from the case facts of a Chinese criminal case dataset and proposed an attention-based method to predict the same along with relevant law articles. Zhong et al. [10] proposed QAjudge, based on reinforcement learning to predict the outcome of a legal case from the facts by visualizing the process giving interpretable judgments. Chen et al. [11] proposed a Deep Gating Network (DGN) to predict the prison term for criminals based on the criminal charges and the case facts. While much of the research focuses on the legal case prediction for a specific setting (such as civil, criminal) with rich annotated cases providing good learning parameters for helping the decision classification, we focus more on predicting the outcome of the general legal cases from large unannotated and unstructured legal documents. Malik et al. [5] introduced a dataset named the Indian Legal Document Corpus (ILDC) and experimented upon it to provide a baseline model with their Case Judgment Prediction and Explanation (CJPE) task which achieves a macro-F1 score of 77.79% and an accuracy of 78% in the judgment prediction task. CJPE is somewhat similar to our task while we aim to leverage our task to French legal documents in the future with more focus to cluster the case documents to their specific types. Because of the similarity of the ILDC dataset with the dataset requirement for our first task (of predicting decisions from unstructured legal documents), we develop, experiment, and evaluate our classification models on the ILDC dataset contributed by Malik et al. [5]. 3. Methods We formulate this task of legal judgement prediction as a text classification problem, given below: For an unstructured legal case document ‘C’, predict its decision ‘D’ among the two labels ‘accepted’(= 1) and ‘rejected’(= 0), given only the facts of the legal document. To move forward with the classification task we experimented with several deep learning architectures and methods detailed hereafter. 3.1. Sequence-to-sequence RNN encoders: We experimented with some of the Recursive Neural Networks (RNN) such as GRU [12] and LSTM [13] with bidirectional nature [14] to process the sequence information in both forward and backward directions. Since the ILDC dataset consists of large documents of variable lengths, each having several sentences (tens of thousands of tokens in total), it becomes computationally complex and expensive to process and determine the embeddings of all individual words as a sequence of sequence i.e. words in sentences in a document for all documents. So instead we resort to encoding the sentences as sequences in a document (i.e sequence of encoded/vectorized sentences). To encode the sentences in the documents we have used separately two state-of- the-art pre-trained sentence encoders namely Universal Sentence Encoder [15] and S-BERT tokenizer [16], trained on general texts. We divided the documents into chunks (with the idea that these chunks can be treated as a near estimate for the sentences in the documents) with overlaps to count for the miss in the sentence breaks while dividing/chunking. These are passed into the encoders to obtain the chunk embeddings. These chunk embeddings for a document are concatenated together for further processing. We used Bi2 -LSTM3 (or Bi2 -GRU4 ) with two layers and dropouts in between with further feed-forward layers for classification. 3.2. Transformer Encoders: Pre-trained transformer [4] encoders such as BERT [3], XLNet [17] have shown significant improvements in language modeling and understanding and can be adapted for downstream tasks with fine-tuning of the internal kernel weights or pre-training it altogether in a domain- specific task either from scratch or from a previous pretrained checkpoint. In our work we have experimented with BERT-base5 and XLNet-base6 trained on general text by fine-tuning on the training set. We have used max-pooling on the output of the final layer to get a document-level representation as an input to a feedforward network for classification. Since the text in the legal domain has a specific lexicon, vocabulary and differs in syntax from the general text, the sentence and document embeddings generated from the models pre-trained on general text may not properly adapt to the domain-specific context. Hence we also tried to check this argument with a BERT model pre-trained on legal text, known as LEGAL-BERT [2]. The same architecture of max-pooling and feedforward network was used to compare with the results from previous models of BERT and XLNet, which can be found in the Table 3. A document is divided into smaller chunks with overlap (as in Section 3.1), each having 512 tokens including the [CLS] and [SEP] token [3]. These chunks are then passed into the tokenizer (of the respective transformer encoder model), and from its output embeddings, the CLS tokens are extracted and taken as the vectorized representation for the chunks. These are then concatenated together to form the tokenized representation for the document to be used as the input for the transformer encoder model. 2 www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional 3 www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM 4 www.tensorflow.org/api_docs/python/tf/keras/layers/GRU 5 https://huggingface.co/docs/transformers/model_doc/bert 6 https://huggingface.co/docs/transformers/model_doc/xlnet 3.3. Hierarchical Transformers (Transformer Encoder + RNN): We used a hierarchical transformer method taking the idea from [18]. The document is divided into chunks (with or without overlaps) of a fixed length of 512 tokens including the [CLS] and [SEP] tokens. Each chunk is passed into the tokenizer to obtain the tokenized representation to be used as input to the respective transformer encoder model. The output from the last layer of the transformer encoder model is max-pooled to get the [CLS] representation for the chunk. Each of these [CLS] representations are accumulated together to form the new sequence to be used as the embedding for further processing with sequence encoder layers(Bi-GRU, BiLSTM, etc.) for classification. The details of the model architecture for the hierarchical transformer can be seen in the Table 1. LEGAL-BERT fine-tuned on ILDC𝑚𝑢𝑙𝑡𝑖 is used to extract the [CLS] representations, owing to its better performance as compared to other transformer architectures (Table 3). Also, it can be argued that even though the LEGAL-BERT is pretrained on US/EU legal texts and not on the Indian legal texts (which differ in lexicons and syntax) [5], its fine-tuned model can be adapted to the respective setting in the same way as other pre-trained models trained on general texts are used (with fine-tuning) for domain-specific downstream tasks (as can be seen in the experimental results in Table 3). In general, we experimented with two different types of setup in this architecture: • Without attention: The accumulated [CLS] vectors are taken as embedding inputs to the sequence models used in Section 3.1, which consists of the general setup of two layers of either Bi-GRU, Bi-LSTM or their combination. Dropouts we also introduced between the bidirectional layers to increase randomization and prevent overfitting. • With attention: Dot-product attention [19] was used with the Bi-LSTM (layer 2) output as the query and key-value pair. Multi-head scaled dot-product attention [4] with the outputs of Bi-LSTM were also used. We used different combinations of query and key value pairs for the multi-head scaled dot-product attention which are: – The accumulated [CLS] representations for a document as the query, and the output from Bi-LSTM (layer 1) as the key-value pair. – The output from Bi-LSTM (Layer 1) as the query, and Bi-LSTM (layer 2) output as the key-value pair. – Bi-LSTM (layer 2) output as the query and key-value pair. 4. Experimental Setup and Hyperparameters For all the experiments and architecture development we used the Tensorflow7 framework, pandas 8 and NumPy 9 library. The pre-trained transformer models were taken from the HuggingFace10 library. The experiments were run on Colab11 with an Nvidia12 Tesla P100(16GB) 7 www.tensorflow.org 8 https://pandas.pydata.org/ 9 https://numpy.org/ 10 https://huggingface.co/ 11 https://colab.research.google.com/ 12 https://www.nvidia.com/ Table 1 Architectural and hyper-parametric details of the models Hyper-parameters and architectural details. 𝑒 = number of epochs, ℎ = number of attention heads, 𝑒𝐷 = embedding dimension, 𝑄 = query, 𝐾 = key, 𝑉 = Value, 𝑛𝐿 = RNN layers, 𝐿𝑖𝑜 = output from 𝑖𝑡ℎ RNN layer, 𝐿𝑎 = activation for RNN layers, Models 𝐿𝑖𝑢 = units in 𝑖𝑡ℎ RNN layer, 𝑛𝐹 = number of feed forward layers, 𝐹𝑑𝑖 = dimension of 𝑖𝑡ℎ feed forward layer, 𝑎𝑖𝑓 = activation function for 𝑖𝑡ℎ feed forward layer, concat = concatenate, drop(p) = Dropout (dropout percent) 𝑛𝐿 = 2, 𝐿𝑎 = 𝑡𝑎𝑛ℎ, 𝐿1𝑢 = 100, 𝐿2𝑢 = 100, Sequence-to-sequence RNN encoders 𝑛𝐹 = 2, 𝐹𝑑1 = 30, 𝐹𝑑2 = 1, (train set = ILDC𝑠𝑖𝑛𝑔𝑙𝑒 , ILDC𝑚𝑢𝑙𝑡𝑖 ) 𝑎𝑓 = 𝑅𝑒𝐿𝑢, 𝑎2𝑓 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (for classification) 1 Universal Sentence Encoder + BiLSTM 𝑒𝐷 = 768, 𝑒 = 3 Universal Sentence Encoder + BiGRU + Dropout(0.01) 𝑒𝐷 = 768, 𝑒 = 6 S-BERT embeddings + BiLSTM 𝑒𝐷 = 384, 𝑒 = 3 S-BERT embeddings + BiLSTM + Dropout(0.01) 𝑒𝐷 = 384, 𝑒 = 6 Pre-Trained Transformer Encoders (train set = ILDC𝑚𝑢𝑙𝑡𝑖 ) 𝑛𝐹 = 1, 𝑎1𝑓 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (for classification) BERT + max-pooled BERT output + feed forward, 𝑒 = 2 XLNet + max-pooled XLNet output + feed forward, 𝑒 = 2 LEGAL-BERT + max-pooled LEGAL-BERT output + feed forward, 𝑒 = 2 𝑛𝐿 = 2, 𝐿𝑎 = 𝑡𝑎𝑛ℎ, 𝐿1𝑢 = 100, 𝐿2𝑢 = 100, Hierarchical Transformers (train set = ILDC𝑠𝑖𝑛𝑔𝑙𝑒 , ILDC𝑚𝑢𝑙𝑡𝑖 ): 𝑛𝐹 = 2, 𝐹𝑑1 = 30, 𝐹𝑑2 = 1, 𝑎1𝑓 = 𝑅𝑒𝐿𝑢, 𝑎2𝑓 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (for classification) Bi-GRU 𝑒=3 Bi-GRU 𝑒 = 10 Bi-LSTM + Bi-GRU 𝑒=6 Bi-LSTM 𝑒=6 𝑒=6 Bi-LSTM + Dropout(0.01) 𝑒 = 8 (for ILDC𝑠𝑖𝑛𝑔𝑙𝑒 ) Bi-LSTM + Dropout + Dot-product attention 𝑄, 𝐾, 𝑉 = 𝐿2𝑜 , 𝑒 = 6 LEGAL-BERT + 𝑄, 𝐾, 𝑉 = 𝐿2𝑜 , ℎ = 16, 𝑒 = 6, (𝑓 𝑖𝑛𝑒-𝑡𝑢𝑛𝑒𝑑) Bi-LSTM + Dropout + Multi-head attention𝛼 (MHA) concat(max-pool(MHA output,𝐿2𝑜 )) to feed forward network 𝑄 = 𝐿2𝑜 ; 𝐾, 𝑉 = drop(0.01)(𝐿2𝑜 ) , ℎ = 16, 𝑒 = 6, Bi-LSTM + Dropout + Multi-head attention𝛽 concat(max-pool(MHA output,𝐿2𝑜 )) to feed forward network 𝑄 = [𝐶𝐿𝑆] representations; 𝐾, 𝑉 = 𝐿2𝑜 , ℎ = 16, Bi-LSTM + Dropout + Multi-head attention𝛾 𝑒 = 6, concat(max-pool(MHA output),drop(0.2)(𝐿2𝑜 )) to feed forward network GPU. For all the experiments 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 activation was used for classification in the last layer. ’Relu’ activation function was chosen for the hidden feed-forward layers, while the sequence models use the 𝑡𝑎𝑛ℎ activation function. Adam [20] was used as the optimization algorithm for training. As this is a problem of binary classification we use ‘binary cross-entropy’ as the loss function. To train the models we reduce the learning rate by a factor of 0.95 based on the updates Table 2 ILDC statistics describing the dataset split and imbalance Split Accepted : Rejected Cases Label Train 1935 : 3147 ILDC𝑠𝑖𝑛𝑔𝑙𝑒 Validation 497 : 497 (7593 cases) Test 762 : 755 0 = Rejected Train 13385 : 18920 1 = Accepted ILDC𝑚𝑢𝑙𝑡𝑖 Validation 497 : 497 (34816 cases) Test 762 : 755 on the monitored metric with patience in two epochs.13 All the transformer models of Section 3.2 were fine-tuned for two epochs with a batch size of 10 documents. The hierarchical transformer architectures in Section 3.3, were trained on a batch size of 32 documents. Architectural specific details and other hyper-parameters can be found in Table 1. 5. Dataset description We used the dataset14 introduced by Malik et al. [5], which contains the case proceedings from the Supreme Court of India. For whether a claim(s) is ‘accepted’ or ‘rejected’ for a case(s) filed by the appellant in the Supreme Court of India is decided by a jury, which is taken as the label for the respective legal case document in the dataset. These labels are used to train the models/architectures in the experiments. The dataset has two parts ILDC𝑠𝑖𝑛𝑔𝑙𝑒 and ILDC𝑚𝑢𝑙𝑡𝑖 . ILDC𝑠𝑖𝑛𝑔𝑙𝑒 consists of those case proceedings for which there is a single decision for a petition or a same decision across all the multiple petitions. While the documents in ILDC𝑚𝑢𝑙𝑡𝑖 are the more common case of case proceedings that involve multiple petitions with different decisions. The labeling of the documents in ILDC𝑚𝑢𝑙𝑡𝑖 is taken as it is (stating the fact that computing multiple decisions for multiple petitions is computationally complex and expensive) where the label is set to be ‘accepted’ class if a single petition among the multiple appeals is ‘accepted’ otherwise it is set to the ‘rejected’ class. The dataset statistics are given in Table 2. We experiment with the same subsets of the dataset for training, validation, and testing as provided by the authors to maintain consistency in the experimental results and compare on the same test cases across all the experiments for decision classification. 6. Results and discussion To measure the model performance we used the macro-precision, macro-recall, and macro-F1 scores as our performance metrics in order for the results to be comparable to the previous models on the same dataset. In Table 3 we omit the results of the pre-trained transformer models trained on ILDC𝑠𝑖𝑛𝑔𝑙𝑒 , since we only use the pre-trained transformer models finetuned on ILDC𝑚𝑢𝑙𝑡𝑖 for further development of the hierarchical models. Also since the number of 13 https://keras.io/api/callbacks/reduce_lr_on_plateau/ 14 The dataset can be requested from its original authors [5]. We do not have the rights to circulate this dataset. Table 3 Experimental results of legal case text classification task on different models metrics (metrics ×100 𝑡𝑜 𝑔𝑒𝑡 % 𝑣𝑎𝑙𝑢𝑒𝑠) Models Accuracy Macro-F1 Macro-Recall Macro-Precision Sequence-to-sequence RNN encoders (train set = ILDC𝑠𝑖𝑛𝑔𝑙𝑒 ) Universal Sentence Encoder + BiLSTM 0.5679 0.5731 0.5686 0.5778 Universal Sentence Encoder + BiGRU + Dropout(0.01) 0.5712 0.5748 0.5718 0.5779 S-BERT embeddings + BiLSTM 0.5528 0.5603 0.5536 0.5672 S-BERT embeddings + BiLSTM + Dropout(0.01) 0.5528 0.5604 0.5537 0.5673 Sequence-to-sequence RNN encoders (train set = ILDC𝑚𝑢𝑙𝑡𝑖 ) Universal Sentence Encoder + BiLSTM 0.5574 0.5858 0.5779 0.5939 Universal Sentence Encoder + BiGRU + Dropout(0.01) 0.5547 0.5606 0.5591 0.5622 S-BERT embeddings + BiLSTM 0.56 0.5567 0.5558 0.5578 S-BERT embeddings + BiLSTM + Dropout(0.01) 0.59 0.5893 0.5869 0.5918 Pre-Trained Transformer Encoders (train set = ILDC𝑚𝑢𝑙𝑡𝑖 ) BERT 0.6052 0.6322 0.6055 0.6613 XLNet 0.7051 0.7103 0.7009 0.7201 LEGAL-BERT 0.7383 0.7382 0.7384 0.7390 Hierarchical Transformers (train set = ILDC𝑠𝑖𝑛𝑔𝑙𝑒 ): Bi-GRU 0.7961 0.8033 0.7966 0.8101 Bi-GRU 0.8001 0.8043 0.8004 0.8082 Bi-LSTM + Bi-GRU 0.7744 0.8029 0.8016 0.8041 Bi-LSTM 0.80 0.8060 0.8018 0.8103 0.79 0.7964 0.7881 0.8051 LEGAL-BERT + Bi-LSTM + Dropout(0.01) 0.81 0.8084 0.8063 0.8106 (𝑓 𝑖𝑛𝑒-𝑡𝑢𝑛𝑒𝑑) Bi-LSTM + Dropout + Dot-product attention 0.79 0.7970 0.7893 0.8048 Bi-LSTM + Dropout + Multi-head attention𝛼 0.80 0.8076 0.8031 0.8123 Bi-LSTM + Dropout + Multi-head attention𝛽 0.81 0.8125 0.8090 0.8160 Bi-LSTM + Dropout + Multi-head attention𝛾 0.80 0.8069 0.8043 0.8095 Hierarchical Transformers (train set = ILDC𝑚𝑢𝑙𝑡𝑖 ): Bi-GRU 0.7915 0.7916 0.7916 0.7916 Bi-GRU 0.7935 0.7943 0.7934 0.7953 Bi-LSTM + Bi-GRU 0.8080 0.8015 0.7932 0.7981 Bi-LSTM 0.80 0.8010 0.7999 0.8021 LEGAL-BERT + Bi-LSTM + Dropout(0.01) 0.80 0.8035 0.8019 0.8052 (𝑓 𝑖𝑛𝑒-𝑡𝑢𝑛𝑒𝑑) Bi-LSTM + Dropout + Dot-product attention 0.80 0.8007 0.7986 0.7997 Bi-LSTM + Dropout + Multi-head attention𝛼 0.80 0.8002 0.7993 0.7998 Bi-LSTM + Dropout + Multi-head attention𝛽 0.81 0.8070 0.8066 0.8073 Bi-LSTM + Dropout + Multi-head attention𝛾 0.80 0.7984 0.7967 0.7975 training instances is much less in ILDC𝑠𝑖𝑛𝑔𝑙𝑒 the fine-tuned transformer models yield less understanding as compared to ILDC𝑚𝑢𝑙𝑡𝑖 . As can be seen in Table 3, the sequence models with the pre-trained encoders (Universal Sentence Encoder and S-BERT encoder) have poor performance in all the performance metrics. This can be accounted for by the fact that these encoders are not fine-tuned during the model training process and also their embeddings are more aligned to general texts rather than on the domain-specific to the legal texts. Even so, the embeddings from the Universal Sentence Encoder give slightly better performance than S-BERT embeddings both in ILDC𝑠𝑖𝑛𝑔𝑙𝑒 and ILDC𝑚𝑢𝑙𝑡𝑖 , without any architectural modifications (i.e. dropouts) to the baseline RNN layer. The pre-trained transformer models trained on general English texts improved the metric scores with BERT achieving a F1 score of 0.6322, and XLNet achieving a F1 score of 0.7103, while the domain-specific LEGAL-BERT model (pre-trained on legal texts) gives the best results, of ≈ 4% increase over the XLNet model. These improvements in the metrics helped us choose LEGAL-BERT to be used as the base layer for our hierarchical transformer models. Bi-GRU over LEGAL-BERT was taken as the baseline model which shows a significant performance improvement over the previous models experimented on this dataset [5] as can be seen in the Table 3. With Bi-LSTM there is a slight improvement in the metric scores in both ILDC𝑠𝑖𝑛𝑔𝑙𝑒 and ILDC𝑚𝑢𝑙𝑡𝑖 . Adding dropouts over the Bi-LSTM layers results in a decrease in the performance in ILDC𝑠𝑖𝑛𝑔𝑙𝑒 for the same number of epochs (= 6) but an improvement in ILDC𝑚𝑢𝑙𝑡𝑖 . Since ILDC𝑠𝑖𝑛𝑔𝑙𝑒 is a small set as compared to ILDC𝑚𝑢𝑙𝑡𝑖 , adding dropouts slows down the model’s ability to converge to the optimum boundary. Hence we trained this model for two more epochs to result in the performance improvement to 0.8084 F1 score in ILDC𝑠𝑖𝑛𝑔𝑙𝑒 (Table 3). There was a marginal decrease in the metrics by using the dot-product attention while using the multi-head attention (with the query and key, value combination as shown in Table 3) resulting in slight performance improvements to 0.8070 and 0.8125 F1 scores in the test-set of ILDC𝑚𝑢𝑙𝑡𝑖 and ILDC𝑠𝑖𝑛𝑔𝑙𝑒 respectively, for the hierarchical transformer model. This shows that the dot-product attention and multi-head scaled dot-product attention mechanism used here do not improve the performance significantly. This can be pointed to the fact that the [CLS] embeddings used for the sequence models in hierarchical transformer, already contains the learnt representations from the internal multi-head attention function of the transformer architecture. To see if other novel attention mechanisms improve the performance of the hierarchical transformers is yet to be explored. 7. Conclusion In this paper, we have explored the problem of decision classification of large unstructured and unannotated legal documents. We aim to formulate this problem as a decision prediction of legal case documents in real-life scenarios. To experiment with our models we used the ILDC dataset. We explored various state-of-the-art pre-trained language models (BERT [3], XLNet [17], LEGAL-BERT [2]), attention mechanisms, and sequence models (LSTM, GRU) for decision prediction tasks on the ILDC dataset. Based upon their performance we developed several baseline hierarchical domain-specific transformer models which improve significantly in the performance metrics of the previous models trained on the ILDC dataset. Our experiments show that LEGAL-BERT (a pre-trained domain-specific language model which is trained on the legal texts of Europe Union and United States court proceedings, each having their own specific legal terms, syntax and grammar), when fine-tuned on the legal case texts of The Supreme Court of India, adapts well to the grammar, lexicon, and syntax of Indian legal system. This finding shows that the domain-specific pre-trained language models can be adapted well to the same domain with different language setting (syntax, grammar, lexicon). We aim to leverage this work on the prediction and classification of French legal cases in the future. Acknowledgments This work was supported by the LawBot project, granted by ANR the French Agence Nationale de la Recherche. References [1] J. A. Segal, Predicting supreme court cases probabilistically: The search and seizure cases, 1962-1981, American Political Science Review 78 (1984) 891–900. doi:10.2307/1955796. [2] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, LEGAL- BERT: the muppets straight out of law school, CoRR abs/2010.02559 (2020). URL: https: //arxiv.org/abs/2010.02559. arXiv:2010.02559. [3] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. arXiv:1810.04805. [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, CoRR abs/1706.03762 (2017). URL: http: //arxiv.org/abs/1706.03762. arXiv:1706.03762. [5] V. Malik, R. Sanjay, S. K. Nigam, K. Ghosh, S. K. Guha, A. Bhattacharya, A. Modi, ILDC for CJPE: indian legal documents corpus for court judgmentprediction and explanation, CoRR abs/2105.13562 (2021). URL: https://arxiv.org/abs/2105.13562. arXiv:2105.13562. [6] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, J. Xu, CAIL2018: A large-scale legal dataset for judgment prediction, CoRR abs/1807.02478 (2018). URL: http://arxiv.org/abs/1807.02478. arXiv:1807.02478. [7] I. Chalkidis, I. Androutsopoulos, N. Aletras, Neural legal judgment prediction in english, CoRR abs/1906.02059 (2019). URL: http://arxiv.org/abs/1906.02059. arXiv:1906.02059. [8] H. Zhong, Z. Guo, C. Tu, C. Xiao, Z. Liu, M. Sun, Legal judgment prediction via topological learning, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 3540–3549. URL: https://aclanthology.org/D18-1390. doi:10.18653/v1/D18-1390. [9] B. Luo, Y. Feng, J. Xu, X. Zhang, D. Zhao, Learning to predict charges for criminal cases with legal basis, CoRR abs/1707.09168 (2017). URL: http://arxiv.org/abs/1707.09168. arXiv:1707.09168. [10] H. Zhong, Y. Wang, C. Tu, T. Zhang, Z. Liu, M. Sun, Iteratively questioning and answering for interpretable legal judgment prediction, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 1250–1257. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5479. [11] H. Chen, D. Cai, W. Dai, Z. Dai, Y. Ding, Charge-based prison term prediction with deep gating network, CoRR abs/1908.11521 (2019). URL: http://arxiv.org/abs/1908.11521. arXiv:1908.11521. [12] K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, CoRR abs/1409.1259 (2014). URL: http://arxiv. org/abs/1409.1259. arXiv:1409.1259. [13] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (1997) 1735–1780. URL: https://doi.org/10.1162/neco.1997.9.8.1735. doi:10.1162/neco.1997.9. 8.1735. [14] M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing 45 (1997) 2673–2681. doi:10.1109/78.650093. [15] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo- Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, R. Kurzweil, Universal sentence encoder, 2018. arXiv:1803.11175. [16] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, CoRR abs/1908.10084 (2019). URL: http://arxiv.org/abs/1908.10084. arXiv:1908.10084. [17] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, CoRR abs/1906.08237 (2019). URL: http://arxiv.org/abs/1906.08237. arXiv:1906.08237. [18] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, N. Dehak, Hierarchical transformers for long document classification, CoRR abs/1910.10781 (2019). URL: http://arxiv.org/abs/1910.10781. arXiv:1910.10781. [19] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Repre- sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/1409.0473. [20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/ 1412.6980.