Fine-grained Intent Classification in the Legal Domain Ankan Mullick∗1 , Abhilash Nandy∗1 , Manav Nitin Kapadnis∗2 , Sohan Patnaik3 and R Raghav4 1 Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur 2 Department of Electrical Engineering, Indian Institute of Technology Kharagpur 3 Department of Mechanical Engineering, Indian Institute of Technology Kharagpur 4 Department of Industrial and Systems Engineering, Indian Institute of Technology Kharagpur Abstract A law practitioner has to go through a lot of long legal case proceedings. To understand the motivation behind the actions of different parties/individuals in a legal case, it is essential that the parts of the document that express an intent corresponding to the case be clearly understood. In this paper, we introduce a dataset of 93 legal documents, belonging to the case categories of either Murder, Land Dispute, Robbery, or Corruption, where phrases expressing intent same as the category of the document are annotated. Also, we annotate fine-grained intents for each such phrase to enable a deeper understanding of the case for a reader. Finally, we analyze the performance of several transformer-based models in automating the process of extracting intent phrases (both at a coarse and a fine-grained level), and classifying a document into one of the possible 4 categories, and observe that, our dataset is challenging, especially in the case of fine-grained intent classification. Keywords Legal, Fine-grained, Intent Classification. (referred to as ‘sub-intent’ interchangeably from here on) to each phrase. These intent phrases are annotated in a coarse (4 categories) as well as in a fine-grained manner 1. Introduction (with several sub-intents in each category of intent). For example, under the intent of Robbery, ’Mr. ABC saw Documents which record legal case proceedings are of- Mr. XYZ picking the lock of the neighbour’s house’ is ten perused by many law practitioners. In any Court an example of a witness. Another example is, ’Gold and Judgement, these documents can contain as much as silver ornaments missing’, indicating the stolen items. 4500 words (for example - Indian Supreme Court Judge- Another contribution is the analysis of different off- ments). Knowing the amount of intent in the text before the-shelf models on intent based task. We finally present hand will help a person understand the case better (intent a proof-of-concept, which shows that coarse-grained doc- here refers to the intention latent in a piece of text. e.g. ument intent and document classification, as well as fine- ‘Mr. XYZ robbed a bank yesterday’ - in this sentence, the grained annotation of phrases in legal documents, can phrase ‘robbed a bank’ depicts the intent of Robbery). be automated with reasonable accuracy. There can be different levels of intent. For example, stating that a legal case deals with murder is a document level intent. It conveys a generalized information about 2. Dataset Description the document. Sentence level and phrase level intents will give much more information about the document. 5000 legal documents are scraped from CommonLII 1 us- To understand the documents much efficiently various ing ‘selenium’ python package. 93 documents belonging summarization techniques exist. However, an analysis of to the categories of Corruption, Murder, Land Dispute, intents conditioned on the legal cases, along with sum- and Robbery are randomly sampled from this larger set. marization, would improve the reader’s understanding Intent phrases are annotated for each document in the and clarity of the content of the document significantly. following manner - We curate a dataset that consists of 93 legal documents, 1. Initial filtering: 2 annotators filter out sen- spread across four intents - Murder, Robbery, Land Dis- tences that convey an intent matching the cat- pute and Corruption. We manually annotate certain egory of the document at hand. phrases which bring out the intent of the document. Ad- ditionally, we painstakingly assign fine-grained intents 2. Intent Phrase annotation 2 other annotators then extract a span from each sentence, so as to SDU@AAAI-22: Workshop on Scientific Document Understanding at exclude any details do not contribute to the in- the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) tent (such as name of the person, date of incident © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 *Equal Contribution http://www.commonlii.org/resources/221.html Avg. length Avg. No. of Avg. no. of Avg. no. of Category of intent Sentiment Score documents words/doc sentences/doc phrase of intent phrases Corruption 17 4466 174 17 0.008 Land Dispute 25 4681 186 19 0.02 Murder 30 2876 135 17 -0.012 Robbery 21 2756 118 9 -0.002 Table 1 Statistics for each category in the dataset. The numbers (other than the average sentiment score) are rounded to the nearest integer. etc.), and only include the words expressing cor- Tesla P100 GPUs with 16 GB RAM to perform all the responding intent. The resulting spans are the experiments. intent phrases. Inter-annotator agreement (Co- hen 𝜅) is 0.79. 3.1. Document Classification 3. Sub-intent annotation: 1 annotator who is aware of legal terminology, is asked to go through Recent advancements show that, Transformer [1] based the intent phrases of several documents from all pre-trained language models like BERT [3], RoBERTa [4], the 4 intent categories in order to come up with ALBERT [5], and DeBERTa [6], have proven to be very possible set of sub-intents for each intent cate- successful in learning robust context-based representa- gory, that covers almost all aspects of that cate- tions of lexicons and applying these to achieve state of gory. After coming up with the sets of sub-intents, the art performance on a variety of downstream tasks 4 annotators are then shown some samples on such as document classification in our case. how to annotate sub-intent for a given phrase. Macro Then, the intent phrases are divided amongst Model Name Accuracy F1-score these annotators, and the sub-intent of each in- BERT 0.63 0.53 tent phrase is annotated thereafter. RoBERTa 0.74 0.64 Table 1 shows the statistics of our dataset, describing ALBERT 0.53 0.61 DeBERTa 0.74 0.71 the number of documents, average length of documents LEGAL-BERT 0.74 0.68 and intent phrases, and average sentiment score for each LEGAL-RoBERTa 0.68 0.69 of the 4 intent categories. The documents on Corrup- tion and Land Dispute are roughly longer than those on Table 2 Murder and Robbery. Table 1 also shows average senti- Results of Transformer Models ment scores across annotated intent phrases (calculated using sentifish 2 Python Package) for each of the four We then implemented different models mentioned in categories. The sentiment scores of the categories fol- Table 2, for learning contextual representations of the low the following order - Land Dispute > Corruption > documents whose outputs were then fed to a softmax Robbery > Murder, which follows common intuition. layer to get the final predicted class of the document. Fig. 1 shows the top 200 most frequent words (exclud- Along with this, we also implemented a variant of LEGAL- ing stopwords) occurring in the intent phrases for each ofBERT [7] and LEGAL-RoBERTa 3 which were pre-trained the four categories, with the font size of the word being on large scale datasets of legal domain-specific corpora proportional to its frequency. In each wordcloud, we can which in turn led to much better scores than their coun- observe that each category has words that match the cor- terparts pre-trained on general corpora. responding intent (E.g. ’bribe’ in Corruption, ’property’ Recent improvements to the state-of-the-art in contex- in Land Dispute etc.) tual language models such as in the case of DeBERTa per- form significantly better than BERT. The same is observed from Table 2 which shows that the Accuracy and Macro 3. Experiment and Results F1-score for DeBERTa came to be the highest among the This section is organized to describe the use of trans- other models, whereas LEGAL-BERT was at par with formers [1] for document classification, which will be DeBERTa in terms of Accuracy score. Further, since De- followed by the explanation for the use of JointBERT BERTa is trained previously using the disentangled atten- [2] for intent as well as slot classification. We use two tion mechanism along with an enhanced mask decoder. 2 3 https://pypi.org/project/sentifish/ https://huggingface.co/saibo/legal-roberta-base (a) Corruption (b) Land Dispute (c) Murder (d) Robbery Figure 1: Wordclouds for each intent category, showing the 200 most frequently occurring words in the intent phrases for the corresponding category The training method is same as that of BERT. Owing 3.2. JointBERT to the novel attention mechanism used in DeBERTa, it We implemented BERT for joint intent classification and outperforms the other models in terms of both Accuracy slot filling [2] on our dataset. We also replaced the BERT and Macro F1-score. backbone with other transformer-based models such as LEGAL-BERT on the other hand is pre-trained and fur- DistilBERT and ALBERT. Slot Filling is a sequence la- ther fine-tuned on legal-domain specific corpora, which belling task, where BIO Tags are for the classes of ‘Cor- in turn lead to its state-of-the-art performance on var- ruption’, Land Dispute’, ‘Robbery’ and ‘Murder’, and then ious legal domain specific tasks. In our case, leverag- the intent classification task for those classes. The dataset ing LEGAL-BERT outperforms other models since the is prepared in the following manner - Since there is a contextual representation is more inclined towards legal majority of ‘O’ Tags for the slot filling task, only sen- matters. tences containing an intent phrase, the one before that, All of the transformer models were implemented us- and the one after that are used for training to mitigate ing sliding window attention [8], since the document class imbalance. Each token has an intent BIO tag and length for all the documents is greater than the trans- each sentence with an intent phrase has a target intent. former maximum token size. They were trained with a We randomly selected 20% sample for testing, 20% for sliding window ratio of 20% over three epochs with learn- validation. Rest 60% samples were used for training. ing rate and batch size set at 2e-5 and 32 respectively. The models were trained over 10 epochs with a batch The documents in the dataset are randomly split into size of 16, at a learning rate of 2e-5. At each epoch check- train, validation and test sets in the ratio of 6:2:2. Note point, the model was saved and the model with the high- that, when classifying fine-grained intents, we only con- est validation accuracy was picked to evaluate on the sider those sub-intents that have atleast 50 corresponding test set. As can be seen from Table 3, BERT proved to be phrases. the best model with an Intent Accuracy as well as Intent We report the Accuracy score and Macro average score Macro F1-score of 0.9. for each of the model so as to get an intuition on how Table 4 gives the evaluation metric scores for each in- the state of art transformer-based architectures perform tent separately and the analysis provides evidence that on document classification in the legal domain. the transformer-based models perform poorly on Cor- ruption intent due to the number of ocuments in that Intent Score for fine-grained intent classification for the best Intent Model Name Macro Accuracy performing model among the three models, i.e., Joint- F1-score BERT with a BERT Backbone. The labels are presented BERT 0.90 0.90 DistilBERT 0.90 0.89 in the form of 𝑋_𝑌, where 𝑋 is an intent (e.g. Robbery), ALBERT 0.88 0.87 and 𝑌 is a fine-grained intent/sub-intent (e.g. action). We observe that, even though the number of training sam- Table 3 ples per fine-grained class is quite low, performance on Results on Intent classification the test set is quite good - The F1-Score for all classes is above 0.4, and except for two classes, it is above the halfway mark of 0.5. category being the lowest, whereas they perform signifi- cantly better on other intents. Precision Recall F1-score Support Corruption_action 0.46 0.60 0.52 10 Precision Recall F1-score Support Land_Dispute_action 0.54 0.70 0.61 20 Land_Dispute_description 0.60 0.35 0.44 17 Corruption 0.75 0.89 0.81 27 Murder_action 0.57 0.48 0.52 25 Land Dispute 0.95 0.88 0.91 42 Murder_description 0.44 0.71 0.54 24 Murder 0.94 0.94 0.94 50 Murder_evidence 0.38 0.23 0.29 13 Robbery 0.96 0.89 0.92 27 Robbery_action 0.71 0.63 0.67 19 Macro Average 0.90 0.90 0.90 146 Robbery_description 0.67 0.33 0.44 12 Macro Average 0.54 0.50 0.50 140 Table 4 Table 7 Results of Joint BERT on Intent Classification Results of Joint BERT on fine-grained Intent Classification Table 5 enumerates the results of Joint BERT on the Note that we have not reported the slot classification task of Slot Classification. The model performs best results for the fine-grained intents. This is because the on Murder intent when compared with others, which number of labels becomes almost twice in this case as is again due to the number of samples in the Murder compared to intent classification (due to the presence category being the largest. of both B and I tags corresponding to each fine-grained Precision Recall F1-score Support intent, and an O class additionally, as we consider BIO Corruption 0.74 0.38 0.51 326 tags for annotation). Hence, the number of samples per Land Dispute 0.71 0.55 0.62 317 class is insufficient to learn a good slot classifier. Murder 0.80 0.63 0.70 361 Robbery 0.66 0.53 0.59 137 Macro Average 0.73 0.52 0.60 1041 4. Discussion Table 5 Results of Joint BERT on Slot Classification We observe that, although transformer-based models are performing well in the case of document classification Table 6 provides the classification accuracy and Intent and coarse-grained intent classification, there is a need Macro F1-score on fine grained Intent Classification task. for better performance in the fine-grained intent classifi- As the intent becomes more specific, the scores drop sig- cation case. Hence, we argue that our dataset could be a nificantly, showing that the models are unable to capture crucial starting point for research on fine-grained intent the in-depth context of the intent phrases. However, mo- classification in the legal domain. dle with the BERT backbone still performs the best. This can be attributed to the fact, that BERT has the high- est number of parameters ( 110 million) as compared to 5. Conclusion ALBERT ( 31 million), and DistilBERT ( 50 million). This paper presents a new dataset for coarse and fine- Intent grained annotation, as well as, shows a proof-of-concept Intent as to how document as well as intent classification can be Model Name Macro Accuracy automated with reasonably good results. We use different F1-score BERT 0.53 0.50 transformer-based models for document classification, DistilBERT 0.46 0.40 and observe that DeBERTa performs the best. We try ALBERT 0.48 0.47 transformer-based models such as BERT, ALBERT and DistilBERT as the backbones of a joint intent and slot Table 6 classification neural network, and observe that, BERT Results on fine-grained Intent Classification performs the best among all the three, both in coarse as well as fine-grained intent classification. However, Table 7 provides the precision, recall and macro F1 our dataset is challenging, as there is a lot of scope of improvement in the results, especially in fine-grained intent classification. Hence, our dataset could serve as a crucial benchmark for fine-grained intent classification in the legal domain. References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- tention is all you need, 2017. a r X i v : 1 7 0 6 . 0 3 7 6 2 . [2] Q. Chen, Z. Zhuo, W. Wang, Bert for joint intent classification and slot filling, 2019. a r X i v : 1 9 0 2 . 1 0 9 0 9 . [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. a r X i v : 1 8 1 0 . 0 4 8 0 5 . [4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining ap- proach, 2019. a r X i v : 1 9 0 7 . 1 1 6 9 2 . [5] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self- supervised learning of language representations, 2020. a r X i v : 1 9 0 9 . 1 1 9 4 2 . [6] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding- enhanced bert with disentangled attention, 2021. arXiv:2006.03654. [7] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- tras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, 2020. a r X i v : 2 0 1 0 . 0 2 5 5 9 . [8] M. A. Masood, R. A. Abbasi, N. Wee Keong, Context- aware sliding window for sentiment classifica- tion, IEEE Access 8 (2020) 4870–4884. doi:1 0 . 1 1 0 9 / ACCESS.2019.2963586.