Fine-grained Intent Classification in the Legal Domain
Ankan Mullick∗1 , Abhilash Nandy∗1 , Manav Nitin Kapadnis∗2 , Sohan Patnaik3 and R Raghav4
1
  Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur
2
  Department of Electrical Engineering, Indian Institute of Technology Kharagpur
3
  Department of Mechanical Engineering, Indian Institute of Technology Kharagpur
4
  Department of Industrial and Systems Engineering, Indian Institute of Technology Kharagpur


                                             Abstract
                                             A law practitioner has to go through a lot of long legal case proceedings. To understand the motivation behind the actions of
                                             different parties/individuals in a legal case, it is essential that the parts of the document that express an intent corresponding to
                                             the case be clearly understood. In this paper, we introduce a dataset of 93 legal documents, belonging to the case categories of
                                             either Murder, Land Dispute, Robbery, or Corruption, where phrases expressing intent same as the category of the document
                                             are annotated. Also, we annotate fine-grained intents for each such phrase to enable a deeper understanding of the case for a
                                             reader. Finally, we analyze the performance of several transformer-based models in automating the process of extracting
                                             intent phrases (both at a coarse and a fine-grained level), and classifying a document into one of the possible 4 categories, and
                                             observe that, our dataset is challenging, especially in the case of fine-grained intent classification.

                                             Keywords
                                             Legal, Fine-grained, Intent Classification.


                                                                      (referred to as ‘sub-intent’ interchangeably from here on)
                                                                      to each phrase. These intent phrases are annotated in a
                                                                      coarse (4 categories) as well as in a fine-grained manner
1. Introduction                                                       (with several sub-intents in each category of intent). For
                                                                      example, under the intent of Robbery, ’Mr. ABC saw
Documents which record legal case proceedings are of-
                                                                      Mr. XYZ picking the lock of the neighbour’s house’ is
ten perused by many law practitioners. In any Court
                                                                      an example of a witness. Another example is, ’Gold and
Judgement, these documents can contain as much as
                                                                      silver ornaments missing’, indicating the stolen items.
4500 words (for example - Indian Supreme Court Judge-
                                                                         Another contribution is the analysis of different off-
ments). Knowing the amount of intent in the text before
                                                                      the-shelf models on intent based task. We finally present
hand will help a person understand the case better (intent
                                                                      a proof-of-concept, which shows that coarse-grained doc-
here refers to the intention latent in a piece of text. e.g.
                                                                      ument intent and document classification, as well as fine-
‘Mr. XYZ robbed a bank yesterday’ - in this sentence, the
                                                                      grained annotation of phrases in legal documents, can
phrase ‘robbed a bank’ depicts the intent of Robbery).
                                                                      be automated with reasonable accuracy.
   There can be different levels of intent. For example,
stating that a legal case deals with murder is a document
level intent. It conveys a generalized information about 2. Dataset Description
the document. Sentence level and phrase level intents
will give much more information about the document. 5000 legal documents are scraped from CommonLII 1 us-
To understand the documents much efficiently various ing ‘selenium’ python package. 93 documents belonging
summarization techniques exist. However, an analysis of to the categories of Corruption, Murder, Land Dispute,
intents conditioned on the legal cases, along with sum- and Robbery are randomly sampled from this larger set.
marization, would improve the reader’s understanding                     Intent phrases are annotated for each document in the
and clarity of the content of the document significantly. following manner -
   We curate a dataset that consists of 93 legal documents,
                                                                           1. Initial filtering: 2 annotators filter out sen-
spread across four intents - Murder, Robbery, Land Dis-
                                                                              tences that convey an intent matching the cat-
pute and Corruption. We manually annotate certain
                                                                              egory of the document at hand.
phrases which bring out the intent of the document. Ad-
ditionally, we painstakingly assign fine-grained intents                   2. Intent Phrase annotation 2 other annotators
                                                                              then extract a span from each sentence, so as to
SDU@AAAI-22: Workshop on Scientific Document Understanding at
                                                                              exclude any details do not contribute to the in-
the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)         tent (such as name of the person, date of incident
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                                                                                                                      1
                    *Equal Contribution                                                                                   http://www.commonlii.org/resources/221.html
                                                                                   Avg. length             Avg.
                                       No. of    Avg. no. of     Avg. no. of
                  Category                                                          of intent       Sentiment Score
                                     documents   words/doc     sentences/doc
                                                                                     phrase         of intent phrases
                  Corruption              17        4466            174                 17                 0.008
                  Land Dispute            25        4681            186                 19                 0.02
                  Murder                  30        2876            135                 17                -0.012
                  Robbery                 21        2756            118                 9                 -0.002

Table 1
Statistics for each category in the dataset. The numbers (other than the average sentiment score) are rounded to the nearest
integer.


       etc.), and only include the words expressing cor-          Tesla P100 GPUs with 16 GB RAM to perform all the
       responding intent. The resulting spans are the             experiments.
       intent phrases. Inter-annotator agreement (Co-
       hen 𝜅) is 0.79.                                            3.1. Document Classification
    3. Sub-intent annotation: 1 annotator who is
       aware of legal terminology, is asked to go through         Recent advancements show that, Transformer [1] based
       the intent phrases of several documents from all           pre-trained language models like BERT [3], RoBERTa [4],
       the 4 intent categories in order to come up with           ALBERT [5], and DeBERTa [6], have proven to be very
       possible set of sub-intents for each intent cate-          successful in learning robust context-based representa-
       gory, that covers almost all aspects of that cate-         tions of lexicons and applying these to achieve state of
       gory. After coming up with the sets of sub-intents,        the art performance on a variety of downstream tasks
       4 annotators are then shown some samples on                such as document classification in our case.
       how to annotate sub-intent for a given phrase.
                                                                                                                    Macro
       Then, the intent phrases are divided amongst                            Model Name           Accuracy
                                                                                                                   F1-score
       these annotators, and the sub-intent of each in-                          BERT                  0.63          0.53
       tent phrase is annotated thereafter.                                     RoBERTa                0.74          0.64
   Table 1 shows the statistics of our dataset, describing                      ALBERT                 0.53          0.61
                                                                                DeBERTa                0.74          0.71
the number of documents, average length of documents
                                                                              LEGAL-BERT               0.74          0.68
and intent phrases, and average sentiment score for each                     LEGAL-RoBERTa             0.68          0.69
of the 4 intent categories. The documents on Corrup-
tion and Land Dispute are roughly longer than those on            Table 2
Murder and Robbery. Table 1 also shows average senti-             Results of Transformer Models
ment scores across annotated intent phrases (calculated
using sentifish 2 Python Package) for each of the four       We then implemented different models mentioned in
categories. The sentiment scores of the categories fol-   Table 2, for learning contextual representations of the
low the following order - Land Dispute > Corruption >     documents whose outputs were then fed to a softmax
Robbery > Murder, which follows common intuition.         layer to get the final predicted class of the document.
   Fig. 1 shows the top 200 most frequent words (exclud-  Along with this, we also implemented a variant of LEGAL-
ing stopwords) occurring in the intent phrases for each ofBERT [7] and LEGAL-RoBERTa 3 which were pre-trained
the four categories, with the font size of the word being on large scale datasets of legal domain-specific corpora
proportional to its frequency. In each wordcloud, we can  which in turn led to much better scores than their coun-
observe that each category has words that match the cor-  terparts pre-trained on general corpora.
responding intent (E.g. ’bribe’ in Corruption, ’property’    Recent improvements to the state-of-the-art in contex-
in Land Dispute etc.)                                     tual language models such as in the case of DeBERTa per-
                                                          form significantly better than BERT. The same is observed
                                                          from Table 2 which shows that the Accuracy and Macro
3. Experiment and Results                                 F1-score for DeBERTa came to be the highest among the
This section is organized to describe the use of trans- other models, whereas LEGAL-BERT was at par with
formers [1] for document classification, which will be DeBERTa in terms of Accuracy score. Further, since De-
followed by the explanation for the use of JointBERT BERTa is trained previously using the disentangled atten-
[2] for intent as well as slot classification. We use two tion mechanism along with an enhanced mask decoder.
    2                                                                 3
        https://pypi.org/project/sentifish/                               https://huggingface.co/saibo/legal-roberta-base
                        (a) Corruption                                               (b) Land Dispute


                          (c) Murder                                                    (d) Robbery
Figure 1: Wordclouds for each intent category, showing the 200 most frequently occurring words in the intent phrases for the
corresponding category


The training method is same as that of BERT. Owing             3.2. JointBERT
to the novel attention mechanism used in DeBERTa, it
                                                               We implemented BERT for joint intent classification and
outperforms the other models in terms of both Accuracy
                                                               slot filling [2] on our dataset. We also replaced the BERT
and Macro F1-score.
                                                               backbone with other transformer-based models such as
   LEGAL-BERT on the other hand is pre-trained and fur-
                                                               DistilBERT and ALBERT. Slot Filling is a sequence la-
ther fine-tuned on legal-domain specific corpora, which
                                                               belling task, where BIO Tags are for the classes of ‘Cor-
in turn lead to its state-of-the-art performance on var-
                                                               ruption’, Land Dispute’, ‘Robbery’ and ‘Murder’, and then
ious legal domain specific tasks. In our case, leverag-
                                                               the intent classification task for those classes. The dataset
ing LEGAL-BERT outperforms other models since the
                                                               is prepared in the following manner - Since there is a
contextual representation is more inclined towards legal
                                                               majority of ‘O’ Tags for the slot filling task, only sen-
matters.
                                                               tences containing an intent phrase, the one before that,
   All of the transformer models were implemented us-
                                                               and the one after that are used for training to mitigate
ing sliding window attention [8], since the document
                                                               class imbalance. Each token has an intent BIO tag and
length for all the documents is greater than the trans-
                                                               each sentence with an intent phrase has a target intent.
former maximum token size. They were trained with a
                                                               We randomly selected 20% sample for testing, 20% for
sliding window ratio of 20% over three epochs with learn-
                                                               validation. Rest 60% samples were used for training.
ing rate and batch size set at 2e-5 and 32 respectively.
                                                                  The models were trained over 10 epochs with a batch
The documents in the dataset are randomly split into
                                                               size of 16, at a learning rate of 2e-5. At each epoch check-
train, validation and test sets in the ratio of 6:2:2. Note
                                                               point, the model was saved and the model with the high-
that, when classifying fine-grained intents, we only con-
                                                               est validation accuracy was picked to evaluate on the
sider those sub-intents that have atleast 50 corresponding
                                                               test set. As can be seen from Table 3, BERT proved to be
phrases.
                                                               the best model with an Intent Accuracy as well as Intent
   We report the Accuracy score and Macro average score
                                                               Macro F1-score of 0.9.
for each of the model so as to get an intuition on how
                                                                  Table 4 gives the evaluation metric scores for each in-
the state of art transformer-based architectures perform
                                                               tent separately and the analysis provides evidence that
on document classification in the legal domain.
                                                               the transformer-based models perform poorly on Cor-
                                                               ruption intent due to the number of ocuments in that
                                              Intent              Score for fine-grained intent classification for the best
                             Intent
          Model Name                          Macro
                            Accuracy                              performing model among the three models, i.e., Joint-
                                             F1-score
                                                                  BERT with a BERT Backbone. The labels are presented
             BERT              0.90            0.90
           DistilBERT          0.90            0.89               in the form of 𝑋_𝑌, where 𝑋 is an intent (e.g. Robbery),
            ALBERT             0.88            0.87               and 𝑌 is a fine-grained intent/sub-intent (e.g. action). We
                                                                  observe that, even though the number of training sam-
Table 3                                                           ples per fine-grained class is quite low, performance on
Results on Intent classification                                  the test set is quite good - The F1-Score for all classes
                                                                  is above 0.4, and except for two classes, it is above the
                                                                  halfway mark of 0.5.
category being the lowest, whereas they perform signifi-
cantly better on other intents.                                                               Precision   Recall   F1-score   Support
                                                                      Corruption_action          0.46      0.60      0.52        10
                   Precision       Recall   F1-score    Support      Land_Dispute_action         0.54      0.70      0.61        20
                                                                   Land_Dispute_description      0.60      0.35      0.44        17
  Corruption          0.75          0.89      0.81        27            Murder_action            0.57      0.48      0.52        25
 Land Dispute         0.95          0.88      0.91        42          Murder_description         0.44      0.71      0.54        24
    Murder            0.94          0.94      0.94        50           Murder_evidence           0.38      0.23      0.29        13
   Robbery            0.96          0.89      0.92        27           Robbery_action            0.71      0.63      0.67        19
 Macro Average        0.90          0.90      0.90        146        Robbery_description         0.67      0.33      0.44        12
                                                                        Macro Average            0.54      0.50      0.50       140
Table 4
                                                                  Table 7
Results of Joint BERT on Intent Classification
                                                                  Results of Joint BERT on fine-grained Intent Classification

   Table 5 enumerates the results of Joint BERT on the
                                                                     Note that we have not reported the slot classification
task of Slot Classification. The model performs best
                                                                  results for the fine-grained intents. This is because the
on Murder intent when compared with others, which
                                                                  number of labels becomes almost twice in this case as
is again due to the number of samples in the Murder
                                                                  compared to intent classification (due to the presence
category being the largest.
                                                                  of both B and I tags corresponding to each fine-grained
                   Precision       Recall   F1-score    Support   intent, and an O class additionally, as we consider BIO
  Corruption          0.74          0.38      0.51        326     tags for annotation). Hence, the number of samples per
 Land Dispute         0.71          0.55      0.62        317     class is insufficient to learn a good slot classifier.
    Murder            0.80          0.63      0.70        361
   Robbery            0.66          0.53      0.59        137
 Macro Average        0.73          0.52      0.60        1041
                                                                  4. Discussion
Table 5
Results of Joint BERT on Slot Classification               We observe that, although transformer-based models are
                                                           performing well in the case of document classification
   Table 6 provides the classification accuracy and Intent and coarse-grained intent classification, there is a need
Macro F1-score on fine grained Intent Classification task. for better performance in the fine-grained intent classifi-
As the intent becomes more specific, the scores drop sig- cation case. Hence, we argue that our dataset could be a
nificantly, showing that the models are unable to capture crucial starting point for research on fine-grained intent
the in-depth context of the intent phrases. However, mo- classification in the legal domain.
dle with the BERT backbone still performs the best. This
can be attributed to the fact, that BERT has the high-
est number of parameters ( 110 million) as compared to
                                                           5. Conclusion
ALBERT ( 31 million), and DistilBERT ( 50 million).        This paper presents a new dataset for coarse and fine-
                                              Intent
                                                                  grained annotation, as well as, shows a proof-of-concept
                             Intent                               as to how document as well as intent classification can be
          Model Name                          Macro
                            Accuracy                              automated with reasonably good results. We use different
                                             F1-score
             BERT              0.53            0.50               transformer-based models for document classification,
           DistilBERT          0.46            0.40               and observe that DeBERTa performs the best. We try
            ALBERT             0.48            0.47               transformer-based models such as BERT, ALBERT and
                                                                  DistilBERT as the backbones of a joint intent and slot
Table 6
                                                                  classification neural network, and observe that, BERT
Results on fine-grained Intent Classification
                                                                  performs the best among all the three, both in coarse
                                                                  as well as fine-grained intent classification. However,
  Table 7 provides the precision, recall and macro F1
our dataset is challenging, as there is a lot of scope of
improvement in the results, especially in fine-grained
intent classification. Hence, our dataset could serve as a
crucial benchmark for fine-grained intent classification
in the legal domain.


References
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
    L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
    tention is all you need, 2017. a r X i v : 1 7 0 6 . 0 3 7 6 2 .
[2] Q. Chen, Z. Zhuo, W. Wang, Bert for joint intent
    classification and slot filling, 2019. a r X i v : 1 9 0 2 . 1 0 9 0 9 .
[3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
    Pre-training of deep bidirectional transformers for
    language understanding, 2019. a r X i v : 1 8 1 0 . 0 4 8 0 5 .
[4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
    O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
    Roberta: A robustly optimized bert pretraining ap-
    proach, 2019. a r X i v : 1 9 0 7 . 1 1 6 9 2 .
[5] Z. Lan, M. Chen, S. Goodman, K. Gimpel,
    P. Sharma, R. Soricut, Albert: A lite bert for self-
    supervised learning of language representations,
    2020. a r X i v : 1 9 0 9 . 1 1 9 4 2 .
[6] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-
    enhanced bert with disentangled attention, 2021.
    arXiv:2006.03654.
[7] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale-
    tras, I. Androutsopoulos, Legal-bert: The muppets
    straight out of law school, 2020. a r X i v : 2 0 1 0 . 0 2 5 5 9 .
[8] M. A. Masood, R. A. Abbasi, N. Wee Keong, Context-
    aware sliding window for sentiment classifica-
    tion, IEEE Access 8 (2020) 4870–4884. doi:1 0 . 1 1 0 9 /
    ACCESS.2019.2963586.