Automatic Detection of Rhetorical Role Labels using ERNIE2.0 and RoBERTa Guneet Singh Kohli1 , PrabSimran Kaur1 and Jatin Bedi1 1 Department of Computer Science and Engineering, Thapar Institute of Information and Technology, Patiala, Punjab, India - 147001. Abstract Automatic detection of the rhetorical roles of sentences in a legal case judgment can help in numerous tasks such as summarizing legal decisions, legal search, etc. Thus, making this problem a field of interest for various researches. Legal case documents, however, are not usually well-structured, which makes this task challenging. In this paper, we propose a multi-class text classification for rhetorical role labeling of legal judgments for task 2 of the track ‘Artificial Intelligence for Legal Assistance’ presented by the Forum of Information Retrieval Evaluation in 2021. We have implemented the following methodology (i) we used ERNIE 2.0 token embedding, which can better capture the lexical, syntactic, and (ii) semantic aspects of information in the training data. The overall F1 score, Precision, and Recall is 0.505, 0.465, and 0.591 respectively, which is third in all the submitted teams. We make our code publicly available at GitHub 1 . Keywords Text classification, Rhetorical role labeling, ERNIE 2.0, Transformer, AILA 1. Introduction In the legal context, the phrase ”Rhetorical labeling” of a sentence refers to understanding the semantic function associated with it. Legal case documents have a standard structure, such as facts of the case, ruling by the lower court, issues being discussed, arguments of the parties, the final judgment of the present court, and so on. Distinguishing these rhetorical roles of sentences in a legal case document can help improve the document’s readability and support in various downstream tasks like semantic similarity, text summarization, case law analysis, etc. However, lack of structure, high specificity, technical vernacular, and multiple themes make it difficult even for human experts to understand the rhetorical roles .Thus rhetorical labeling is an extremely challenging NLP task. Prior attempts in this work focused on automation using hand-crafted features such as the sequential order of labels or linguistic cues to indicate rhetorical roles. Paheli Bhattacharya proposed deep learning neural network models (a Hierarchical BiLSTM model and a Hierarchical BiLSTM-CRF model) that outperform the hand-crafted features approach by automatically learning the features using pre-trained legal embeddings. The task 2 of ’Artificial Intelligence for Legal Assistance’ track proposed by FIRE 2021 [1], focuses on the motivation to classify these roles. In this work, we attempt to give every 1 https://github.com/guneetsk99/AILA2021 Forum for Information Retrieval Evaluation, December 13-17, 2021 Envelope-Open gkohli_be18@thapar.edu (G. S. Kohli); pkaur_be18@thapar.edu (P. Kaur); jatin.bedi@thapar.edu (J. Bedi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) sentence in the document one of the seven labels: Facts, Ruling by Lower Court, Argument, Statute, Precedent, Ratio of the decision, Ruling by Present Court.Our team made the following contributions to this problem as part of the shared task effort: • We use ERNIE 2.0 token embeddings, which can better capture the lexical, syntactic, and semantic aspects of information in the training data.; • We perform single attention learning to capture long-range relations.; 2. Related Work There have been various attempts towards the development of automatic identification of rhetorical roles. Initial work focused on understanding the rhetorical roles in case documents to summarize these documents. Later the focus shifted to applying techniques on handcrafted features for segmenting a document into functional and issue-specific parts. For instance, [2] used Conditional Random Fields (CRF) to classify the document into seven rhetorical roles to bring out an effective summary. [3] looked into the segmentation of U.S. court documents into functional and issue-specific parts CRF with handcrafted features. [4] proposed a skip-gram model for identifying factual and non-factual sentences using a classifier model from fastText library. In another line of work, [5] used the Machine Learning approaches to compare the use of rule-based scripts. Unlike all these works that focused on using handcrafted features to identify rhetorical roles in the legal domain automatically, [6] used a Deep learning approach where no handcrafted features were required. [6] used a hierarchical BiLSTM model, which does not require handcrafted features and performs much better at this task.Later, more deep learning approaches were used for the classification of rhetorical roles. [6] produced a fully annotated dataset 53,210 documents that they collected from a (http://www.westlawindia.com). Later more deep learning approches were applied on the data. For instance, [7] used Roberta embeddings and passed the output through a neural network model for classification. [8] used Bert model for the classification purpose. 3. Dataset The dataset provided by AILA 2021 contained 60 legal case documents. The training sample included 50 annotated legal text documents, including 9380 training data as shown in Table 1. The test included 10 annotated legal text documents, including 1905 test data as shown in Table 1. Every sentence in the documents was assigned one of the 7 rhetorical roles, as explained below. 1. Facts (abbreviated as FAC): These refer to the sentences that contain information that led to the filing of the case and how it evolved in the legal system. (First Information Report at a police station, filing an appeal). 2. Ruling by Lower Court (abbreviated as RLC): Since the cases mentioned in the dataset were from the Supreme court, preliminary ruling by the lower courts (Tribunal, High Court, etc.). These sentences correspond to the verdicts given by the lower courts. Table 1 Proportion of each label Label category Train Dataset Test Dataset Ratio of the decision 3624 38.64 587 30.81% Facts 2219 23.66% 403 21.15% Precedent 1468 15.65% 319 16.75% Argument 845 9.0% 256 13.44% Statute 646 6.89% 167 8.77% Ruling by lower court 316 3.37% 94 4.93% Ruling by Present court 262 2.79% 79 4.15% Total 9380 100% 1905 100% 3. Argument (abbreviated as ARG): These sentences contain the court’s discussion over the arguments presented by the opposing parties. 4. Statute (abbreviated as STA): Established law cited from various sources. 5. Precedent (abbreviated as PRE): Relevant precedent cited. These are similar to the statute citations. 6. Ratio of the decision (abbreviated as Ratio): These sentences denote the rationale/rea- soning given by the Supreme Court for the application of any legal principle to the legal issue (final judgment). 7. Ruling by Present Court (abbreviated as RLC): These sentences denote the final decision given by the Supreme Court for that case document. 4. Methodology The Legal case documents are usually lengthy and unstructured, full of legal jargons with missing headings thus making them difficult to read. It becomes a tedious task for a reader to understand where the components are located like facts: that led to filing of cases, arguments: arguments presented by the contenders, statutes: relevant citations to previous statutes and many others similar categories related to legal proceedings. Therefore, the task of seman- tic/thematic segmentation, also known as rhetorical role labelling of sentences, becomes an important task. It not only enhances the readability of the document but also has applications in several downstream tasks like summarization, case law analysis, semantic search and so on thus increasing the use case of the data along with opening various possibilities. The methodology used, focused on understanding the semantic relation between the tokens of the sentences that implied towards the legal apprehensions that could be mapped with the provided tokens. The overall task was considered as an extension of a sentence classification problem with seven labels to be predicted. We employed the use of dedicated pre-processing techniques which helped in efficient handling of sentences with token length less than ten by integrating the existing knowledge derived from the data and the observation observed during text EDA. The above pre-processing technique combined with efficient removal of stop words and application of the inflectional stemming which helped in increasing the accuracy of the information retrieval system. The processed text was passed through the complete pipeline of a deep learning-based transformers approach to derive the accurate contextual perceptions of the text and efficiently establish them with the corresponding labels in the data. 4.1. Data Preparation Properly cleaned data is essential for the correct text analysis and removing unwanted noise before feeding it into the model. Thus, we performed simple preprocessing, which includes tokenization, stopword removal, and lemmatization. Initially, the sentences were split into smaller pieces or ”tokens.” The data was further cleaned by removing common words(stopwords) like ”we” and ”are,” which does not help in text classifications. Finally, the words were lemmatized to obtain the lemma, or base form, of the words. Additionally, a surprising observation was made regarding the pattern of labels in the case of sentences with word lengths less than 10. Such sentences had the same labels as their previous statement’s label, which helped make the prediction easier for the model in case of lower sentence length, thus making our approach robust enough to handle smaller sentences that lacked high semantic knowledge. 4.2. Modelling RoBERTa: [9] RoBERTa, retrains BERT with an improved methodology, much more data, larger batch size and longer training times. In RoBERTa the training strategy of BERT is modified by removing the NSP objective. Further, RoBERTa uses byte pair encoding (BPE) as a tokenization algorithm instead of Word Piece tokenization in BERT. BERT: [10]is a bidirectional language model which aims to learn contextual relations between words using the transformer architecture. We use an official release of the pre-trained models, details about the specific hyperparameters are found in Section V-A. The input to BERT is either a single text (a sentence or document), or a text pair. The first token of each sequence is the special classification token [CLS], followed by WordPiece tokens of the first text A, then a separator token [SEP], and (optionally) after that WordPiece tokens for the second text B. In addition to token embeddings, BERT uses positional embeddings to represent the position of tokens in the sequence. For training, BERT applies Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives. In MLM, BERT randomly masks 15 ERNIE2.0 Transformer Encoder: [11] The model uses a multi-layer Transformer [12] as the basic encoder like other pre-training models such as GPT [13], andBERT [10]. The transformer can capture the contextual information for each token in the sequence via self- attention, and generate a sequence of contextual embeddings. Given a sequence, the special classification embedding [CLS] is added to the first place of the sequence. Furthermore, the symbol of [SEP] is added as the separator in the intervals of the segments for the multiple input segment tasks. Task Embedding The model feeds task embedding to represent the characteristics of different tasks. We represent different tasks with an id ranging from 0 to N. Each task id is assigned to one unique task embedding. The corresponding token, segment, position and task embedding are taken as the input of the model. We can use any task id to initialize our model in the fine-tuning process. In the present research work, we have implemented the ERNIE 2.0 model to carry out the task Table 2 Submission RUN1 (ERNIE 2.0) ERNIE 2.0 Precision Recall Fscore Argument 0.644 0.744 0.691 Facts 0.622 0.565 0.592 Precedent 0.281 0.582 0.379 Ratio_of_the_decision 0.708 0.545 0.616 Ruling_by_Lower_Court 0.024 0.067 0.036 Ruling_by_Present_Court 0.435 0.769 0.556 Statute 0.542 0.867 0.667 Overall 0.465 0.591 0.505 Table 3 Submission RUN2 (RoBERTa) RoBERTa Precision Recall Fscore Argument 0.558 0.744 0.637 Facts 0.616 0.565 0.590 Precedent 0.279 0.612 0.383 Ratio_of_the_decision 0.720 0.517 0.602 Ruling_by_Lower_Court 0.043 0.133 0.065 Ruling_by_Present_Court 0.413 0.731 0.528 Statute 0.522 0.800 0.632 Overall 0.450 0.586 0.491 of Sentence Label Classification which has been established to have outperformed BERT and the recent XLNet in 16 NLP tasks in Chinese and English Language. The base model contains 12 layers, 12 self-attention heads and 768-dimensional of hidden size while the large model contains 24 layers, 16 self-attention heads and 1024-dimensional of hidden size. The model settings of XLNet are the same as BERT. The transformer employed for prediction used ERNIE 2.0 pretrained token embeddings which are known to have better contextual dependencies with each other thus helping us in mitigating the deviation from the text and output label relationship. A single attention mechanism was applied on the token embeddings to derive better understanding of the hidden relationships that could help us in determining the output labels in a more optimized and accurate way. The mentioned pipeline was tested with models like ROBERTA (Base), LawBert and Bert-base-uncased however the performance of the ERNIE 2.0 architecture along with the embeddings generated performed best in our case. 5. Experimentation and Results The data was tested using the proposed methodology where the two runs submitted are different from each other based on the difference in the base model used which is ERNIE 2.0 in case of Run1 and RoBERTa in case of Run2. After the thorough analysis of the data processed by ERNIE 2.0 Precision Recall Fscore 1.00 0.75 0.50 0.25 0.00 e l s t t d al ow es en en ct ut e_ r Fa Pr ve at m ed _L th St gu _ O ec by f_ by Ar _o Pr g_ g_ io in in at ul ul R R R (a) ERNIE (Run-1) RoBERTa Precision Recall Fscore 0.8 0.6 0.4 0.2 0.0 e l s t t d l w s en en ra ct ut e_ re o Fa ve at um d _L _P th ce St O by f_ g by e Ar _o Pr g_ g_ io in in at ul ul R R R (b) RoBERT (Run-2) Figure 1: Visualization of Comparison Results our proposed scheme the best run was established to be the use ERNIE 2.0 based sequence classification pipeline built upon the ERNIE 2.0 pre trained token embeddings which are known Table 4 Hyper parameter tuning of token length Token Length Run1 F1 score on validation Run2 F1 score on validation 200 0.684 0.677 250 0.716 0.704 300 0.653 0.671 Table 5 Validation of proposed methodology Token Length Run1 F1 score on validation Run2 F1 score on validation Yes 0.512 0.501 No 0.455 0.478 to capture the contextual understanding in the English language better than existing models like BERT, RoBERTa, XLNet. In reference to Table 2 and Table 3, it can be observed that the f1 score for particularly Argument (ARG), Facts (FAC), Ratio of the decision (Ratio) and ruling by present court (RLC) is more than 0.5 which indicates the ability of our methodology to capture the underlying meanings of the said labels. On closely observing the results of submitted runs (shown in Figure 1), we can derive the conclusion of ERNIE 2.0 in establishing itself as a better analyzer of the legal sentences. The use of Roberta gives a better score when compared on the basis of Precedent and Ruling by lower court however in other labels RUN1 establishes itself to be the go-to approach. When looked at each label separately it can be observed that the percentage of improvement in results from Roberta to Ernie or from Run2 to Run1 can be attributed to the fact that Ernie has a more complex architecture with 12 self-attention heads that are more robust to the category of data that the model encountered. The run1 gets the f1 score of 0.505 which is a direct 3% improvement of results. In reference to Table 4, the token length was set at 250 in the case of Run1 and 300 in case of Run2 however the final outcome showed the accuracy of ERNIE with 250 token lengths to be better. Also, Table 5 validates the approach of using the proposed pre-processing as the results were observed to be getting better a value of 0.05 f1 score. The epochs for both the models were set at 15 and the model were trained on Tesla P100-PCIE-16GB. The final result in case of RUN1 was overall precision of 0.465 in comparison to RUN2’s precision of 0.450 and the Recall of Run1 was 0.591 in comparison to Run’s 2 recall of 0.586. 6. Conclusion From the overall experiments carried out on the legal corpus it can be concluded that ERNIE 2.0 comes out as the better analyser of Legal Text and has the capability to capture the underlying meaning in the best way. The corpus having 7 labels have contextual overlapping which makes it difficult for various models in giving a higher performance. However the use of better deep learning approaches with advanced embeddings of ERNIE 2.0 makes the above problem easier to solve. For future purposed ensembling of RoBERTa, ERNIE 2.0 and LawBERT can have promising results along with more exploration of preprocessing on the basis of corresponding token lengths as tried in our proposed methodology. References [1] V. Parikh, V. Mathur, P. Mehta, N. Mittal, P. Majumder, Lawsum: A weakly supervised approach for indian legal document summarization, arXiv preprint arXiv:2110.01188v3 (2021). [2] M. Saravanan, B. Ravindran, S. Raman, Automatic identification of rhetorical roles using conditional random fields for legal document summarization, in: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I, 2008. [3] J. Savelka, K. D. Ashley, Segmenting us court decisions into functional and issue specific parts., in: JURIX, 2018, pp. 111–120. [4] I. Nejadgholi, R. Bougueng, S. Witherspoon, A semi-supervised training method for semantic search of legal facts in canadian immigration cases., in: JURIX, 2017, pp. 125–134. [5] V. R. Walker, K. Pillaipakkamnatt, A. M. Davidson, M. Linares, D. J. Pesce, Automatic classification of rhetorical roles for sentences: Comparing rule-based scripts with machine learning., in: ASAIL@ ICAIL, 2019. [6] S. Ghosh, A. Wyner, Identification of rhetorical roles of sentences in indian legal judgments, in: Legal Knowledge and Information Systems: JURIX 2019: The Thirty-second Annual Conference, volume 322, IOS Press, 2019, p. 3. [7] S. B. Majumder, D. Das, Rhetorical role labelling for legal judgements using roberta., in: FIRE (Working Notes), 2020, pp. 22–25. [8] Y. Xu, T. Li, Z. Han, The language model for legal retrieval and bert-based model for rhetorical role labeling for legal judgments., in: FIRE (Working Notes), 2020, pp. 71–75. [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [11] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, H. Wang, Ernie 2.0: A continual pre- training framework for language understanding, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 8968–8975. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9.