Legal Text Classification and Summarization using
Transformers and Joint Text Features
Shaz Furniturewala, Racchit Jain, Vijay Kumari and Yashvardhan Sharma
Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani, Pilani
Campus


                                      Abstract
                                      This paper presents the approaches undertaken while performing relevance classification on legal
                                      documents and thereby making summaries of them using extractive summarization for task 2 of the
                                      track ‘Artificial Intelligence for Legal Assistance’[1] proposed by the Forum of Information Retrieval
                                      Evaluation in 2021[2]. The approaches for relevance classification include fine tuning BERT for the down
                                      stream task of relevance classification and then using joint text features to classify relevance.

                                      Keywords
                                      relevance classification, joint text features, BERT, extractive summarization, AILA


1. Introduction
Legal case documents have an extremely domain specific language structure and therefore, any
kind of operation/analysis on these documents requires human legal experts that can perform
these tasks with accuracy and speed. One of these tasks is the summarization of legal documents.
Legal domain often requires these summaries to provide compressed but accurate information
about judgements and decisions related to a particular case, however due to the specificity
of the domain this is often done by a legal expert. The amount of time and skill required to
make these summaries manually prove them to be very expensive. Therefore there’s a need
for an automated method of summarization of these legal documents. This paper discusses
approaches for extractive summarization of such documents. The ’Artificial Intelligence for
Legal Assistance’ track proposed by FIRE 2021, comprised of two tasks. This paper will discuss
task-2 of this track, ’Summarization of Legal Judgements’. Each team was provided with an
annotated dataset of 500 Supreme Court legal documents along with a headnote summary for
each of them. Every sentence in the document was given one out of the seven labels: Facts,
Ruling by Lower Court, Argument, Statute, Precedent, Ratio of the decision, Ruling by Present
Court as well as a binary label that defines if a particular sentence is relevant or not. The
presented approach achieved 1st rank in task 2a of the conference with a precision of 0.64, a
recall of 0.58 and an F1 score of 0.59.


Forum for Information Retrieval Evaluation 2021, December 13–17, 2021, India
Envelope-Open f20200025@pilani.bits-pilani.ac.in (S. Furniturewala); f20190145@pilani.bits-pilani.ac.in (R. Jain);
p20190065@pilani.bits-pilani.ac.in (V. Kumari); yash@pilani.bits-pilani.ac.in (Y. Sharma)
GLOBE https://www.bits-pilani.ac.in/pilani/yash/profile (Y. Sharma)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work
Based on the comparative study of legal text summarization algorithms done by Paheli Bhat-
tacharya et al [3]. it was found that state of the art legal text summarization is done using legal
domain specific extractive summarization algorithms. Another approach was found by Atefeh
Farzinder et al. [4] who chose to deconstruct the thematic structure of the legal text and identify
various themes to improve summarization. An innovative technique legal text classification
was found by Jiaming Gao et al [5]. They created a joint feature vector of the legal text by
concatenating the statistical feature vector (obtained using tfidf) and the semantic feature vector
(obtained from BERT source code). This was then classified using different classifiers.


3. Dataset
The training dataset provided by AILA 2021[6] contained 500 document-summary pairs. Each
document was annotated by a legal expert and marked with one of seven rhetorical labels as
well as relevance to the summary. The role labels are as follows:

   1. Facts (FAC): sentences that describe the events that led to the filing of the case
   2. Ruling by Lower Court(RLC): Indian Supreme Court cases are given a preliminary
      ruling by one of the lower courts such as the Tribunal or the High Court. This role denotes
      sentences that are a ruling/decision by these lower courts
   3. Argument(ARG): sentences that correspond to the arguments made by each of the
      opposing parties
   4. Statute(STA): relevant statute cited
   5. Precedent(PRE): relevant precedent cited
   6. Ratio of the decision (Ratio): sentences that denote the rationale/reasoning given by
      the Supreme Court for the final judgement
   7. Ruling by Present Court(RLC): sentences that denote the final decision given by the
      Supreme Court for that case document

The train data contained 72192 sentences as training samples. The test data was 50 headnotes
annotated with 7 rhetorical roles. This contained a total of 5066 samples. Task 2a required us to
label relevant sentences and task 2b required us to create summaries.


4. Proposed Technique
4.1. Task2a
For this task we propose two techniques, The first one is fine tuning Legal-BERT[7] a pretrained
language model on legal data for the downstream task of relevance classification and the next
one is using a join text feature approach where we concatenate the statistical features that is
the TF-IDF vectors of the judgements with deep semantic features generated by the Legal-BERT
model.
Table 1
Pretraining corpora
          Corpus       No. of Documents    Total Size in GB           Repository
      EU Legislation         61,826          1.9 (16.5%)       EURLEX (eur-lex.europa.eu)
      UK Legislation          19867          1.4(12.2%)         LEGISLATION.GOV.UK
        ECJ cases             19867          0.6 ( 5.2%)               EURLEX
       ECHR cases             12554          0.5 ( 4.3%)               HUDOC
      US court cases         164141          3.2 (27.8%)      CASE LAW ACCESS PROJECT
       US contracts           76366          3.9 (34.0%)             SEC-EDGAR


4.1.1. Legal-BERT
This method utilizes Transformers based models for the task of relevance classification, this
method is similar to that discussed in [8]. The proposed model uses a modified pretrained
BERT[9] encoder called LEGAL-BERT-BASE. It is part of a family of BERT models designed to
assist natural language processing tasks for the legal domain.

4.1.2. Pretraining of Legal-BERT
The model was pretrained on 12 GB of legal data of various formats from various public sources.
The pretraining corpus was: The model has the same architecture as BERT-BASE. It has 12
layers, 768 hidden units and 12 attention heads. This makes it a total of 110M parameters.
LEGAL-BERT is trained for 1M steps, approximately 40 epochs, over all of the corpora. Batches
consisted of 256 samples and each sentence had up to 512 tokens.

4.1.3. Fine Tuning of Legal-BERT
The BERT AutoTokenizer was used to tokenize the inputs. Training data was fed into the model
in batches of 32 and trained for 2 epochs. The entire pretrained LEGAL-BERT model was fine
tuned for the downstream task of sentence classification into one of two categories (relevant or
irrelevant). Adam Optimizer was used while training with a learning rate of 1e-4. The seed
value was set to 42 and the model was fine tuned for 2 epochs.

4.1.4. Joint Text Features
Based on the conclusions of Jiaming Gao et al[5]. in their paper on legal text classification
we converted the legal text into statistical features and semantic features and combined them
for the classification task. The tf-idf vector of the text was used for the statistical features.
The vector for train data was acquired through the tfidf vectorizer tool by scikit-learn. The
dimensionality of this vector was then reduced to 5000 from 30000 using Latent Semantic
Analysis [10] (truncatedSVD) tool of scikit-learn. The semantic feature of the text was acquired
from the source code of LEGAL-BERT. From the output results of the last hidden layer the
feature vector of the CLS token was extracted. This is a 768-dimensional deep semantic feature
of the legal text. The CLS token is also called the Classification token. The reason this token
is used is because it’s a fixed embedding that is present at the beginning of every sentence.
This indicates that the CLS vector contains BERT’s understanding of the sentence because the
output of this token is inferred by all the words in the sentence. This means this vector contains
all the information which is very useful for a sentence classification task. The final joint text
feature was created by concatenating these two feature vectors to form a 5768 dimensional
vector. This joint text feature was ultimately classified by a Support Vector Machine and using
Logistic Regression.

Figure 1: Joint Text Feature model as described in Jiaming Gao et al [5]


4.2. Task2b
For the purposes of extractive summarization, we utilized the results of the classification model.
Sentences that were labeled relevant by the model were concatenated into one summary. This
ensured that the semantic features learnt by the classification model were also used to write an
Table 2
Rouge scores on summary by relevant sentences given by fine tuned Legal-BERT model
                 Rouge Test    Average recall   Average precision   Average F-score
                 ROUGE-1          0.49168            0.68037             0.53006

                 ROUGE-2          0.28433            0.39362             0.30703

                 ROUGE-3          0.19134            0.26491             0.20731

                 ROUGE-4          0.14920            0.20849             0.16243


extractive summary leading to greater efficiency because a second network did not need to be
trained. This approach was chosen based on the results obtained by Paheli Bhattacharya et al.[3]
in their paper. They found that legal document specific state of the art extractive summarization
produced better results than state of the art classical extractive summarization techniques and
neural network based abstractive summarization techniques.


5. Results and Evaluation
The submitted model achieved 1st rank based on precision, recall and f-scores for the task of
relevance classification. The Legal-BERT approach achieved a precision of 0.64, a recall of 0.58
and an F1-score of 0.59. The joint text feature approach with the deep semantic features from
Legal-BERT and statistical features from TF-IDF vectors gave an accuracy of 0.75 with SVM
classifier and 0.747 with Logistic Regression classifier. The joint text feature model was trained
on 20000 sentences and tested on 5233 sentences. The summaries were evaluated on the basis
of their rouge scores. The concatenation of relevant sentences classified by the Legal-BERT
model gives the rouge-scores as specified in table 2.


6. Conclusion and Future Work
In this paper, we utilized two methods to solve a sentence classification problem. The first
method was using LEGAL-BERT, a BERT model that had been pretrained entirely on legal
domain data, to classify sentences. This gave us an accuracy of 0.78 (our own evaluation) when
trained on 53000 sentences and tested on 18000 sentences. The second method involved creating
a joint feature vector of the legal text by combining statistical features, acquired using tf-idf, and
semantic text features, extracted from LEGAL-BERT. This joint feature vector was then classified
using SVM and Logistic Regression. This gave us an accuracy of 0.75 (our own evaluation) when
trained on 20000 sentences and tested on 5000 sentences. Given that this accuracy is comparable
to LEGAL-BERT even though it was trained on a much smaller dataset shows that this method
has potential to provide much better results. In addition to classification we concatenated the
sentences labeled relevant by both models into an extractive summary. These summaries also
gave great results. For future improvements on these models we could increase the training data
and connect the BERT CLS vector to a classification network that would allow better learning
of semantic features. In addition, we could also take role [11][12] based filtering into account
and incorporate that into BERT. For text summarization, the convolutional model implemented
by Misha Denil et al. [13] could be utilized after concatenating relevant sentences to improve
rouge scores.


References
 [1] V. Parikh, U. Bhattacharya, P. Mehta, B. Ayan, P. Bhattacharya, K. Ghosh, S. Ghosh, A. Pal,
     A. Bhattacharya, P. Majumder, Overview of the third shared task on artificial intelligence
     for legal assistance at fire 2021, in: FIRE (Working Notes), 2021.
 [2] V. Parikh, U. Bhattacharya, P. Mehta, B. Ayan, P. Bhattacharya, K. Ghosh, S. Ghosh,
     A. Pal, A. Bhattacharya, P. Majumder, Fire 2021 aila track: Artificial intelligence for legal
     assistance, in: Proceedings of the 13th Forum for Information Retrieval Evaluation, 2021.
 [3] P. Bhattacharya, K. Hiware, S. Rajgaria, N. Pochhi, K. Ghosh, S. Ghosh, A Comparative
     Study of Summarization Algorithms Applied to Legal Case Judgments, 2019, pp. 413–428.
     doi:10.1007/978- 3- 030- 15712- 8_27 .
 [4] A. Farzindar, G. Lapalme, Letsum, an automatic legal text summarizing system, Jurix
     (2004) 11–18.
 [5] J. Gaoa, H. Ninga, Z. Han, L. Kongb, H. Qib, Legal text classification model based on text
     statistical features and deep semantic features (2020).
 [6] V. Parikh, V. Mathur, P. Mehta, N. Mittal, P. Majumder, Lawsum: A weakly supervised
     approach for indian legal document summarization, arXiv preprint arXiv:2110.01188v3
     (2021).
 [7] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert:
     The muppets straight out of law school, arXiv preprint arXiv:2010.02559 (2020).
 [8] R. Jain, A. Agarwal, Y. Sharma, Spectre@ aila-fire2020: Supervised rhetorical role labeling
     for legal judgments using transformers., 2020.
 [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, in: NAACL-HLT, 2019.
[10] S. T. Dumais, Latent semantic analysis, Annual review of information science and
     technology 38 (2004) 188–230.
[11] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, A. Wyner, Identification of rhetorical roles
     of sentences in indian legal judgments, in: Proc. International Conference on Legal
     Knowledge and Information Systems (JURIX), 2019.
[12] P. Bhattacharya, P. Mehta, K. Ghosh, S. Ghosh, A. Pal, A. Bhattacharya, P. Majumder,
     Overview of the FIRE 2020 AILA track: Artificial Intelligence for Legal Assistance, in:
     Proceedings of FIRE 2020 - Forum for Information Retrieval Evaluation, 2020.
[13] M. Denil, A. Demiraj, N. Freitas, Extraction of salient sentences from labelled documents
     (2014).