Rhetorical Labeling for Legal Judgements using
fastText
Tebo Leburu-Dingalo, Edwin Thuma, Gontlafetse Mosweunyane and
Nkwebi Peace Motlogelwa
Department of Computer Science, University of Botswana


                                      Abstract
                                      This paper describes our participating systems in the FIRE AILA 2021 shared task on predicting rhetorical
                                      roles for sentences in a legal judgement document. In particular we propose three multi-class classifiers
                                      to predict for each of the sentences a rhetorical role from the following: facts, arguments, ratio of the
                                      decision, precedent, statutes, ruling of lower court and ruling of present court. Each of the classifiers
                                      uses a supervised fastText model. As input tokens the first classifier uses unigrams, the second one
                                      used bigrams and the last one uses trigrams. Our system that uses trigrams attains an F-Score of 0.340
                                      followed closely by the bigram system at 0.338 while the baseline has a score of 0.317.

                                      Keywords
                                      Rhetorical Role, Facts, Arguments, fastText


1. Introduction
Lawyers or law practitioners often have to consult relevant precedent cases and statutes while
preparing legal reasoning for a court case. Since court documents are large in number, it will be
beneficial to have an automated tool that assists lawyers to retrieve relevant previous cases and
statutes [1]. In addition, court documents are generally very long and unstructured, often with
no section or paragraph headings . This negatively impacts the readability of the documents
as identifying the most important segments such as facts, arguments and precedents tends
to be difficult for the user. Hence there is a need to automatically identify and segment the
documents into these meaningful parts to ease readability and allow lawyers timely access
to the most crucial information when required. The FIRE AILA 2021 Task 1 Rhetorical Role
Labelling for Legal Judgements was suggested as a way to mitigate the difficulty of searching in
long and unstructured Indian court documents when the user is looking for specific sections in
the documents [2, 3]. To accomplish this, the task suggests that identifying semantic function a
sentence in the document is associated with has to b, Prasenjit, Prasenjite understood. This is
termed rhetorical role labelling [4]. The task considers seven rhetorical labels inherent to legal
documents which are facts of the case, ruling by the lower court, argument, statute, precedent,

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
" leburut@ub.ac.bw (T. Leburu-Dingalo); thumae@ub.ac.bw (E. Thuma); mosweuny@ub.ac.bw
(G. Mosweunyane); motlogel@ub.ac.ub.bw (N. P. Motlogelwa)
~ https://www.ub.bw/connect/staff/202 (T. Leburu-Dingalo); https://www.ub.bw/connect/staff/1966 (E. Thuma);
https://www.ub.bw/connect/staff/1379 (G. Mosweunyane); https://www.ub.bw/connect/staff/830 (N. P. Motlogelwa)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
ratio of the decision and ruling by the present court . The task was started as part of FIRE
2020 AILA Track [4]. For the task, a training dataset consisting of 50 documents containing 9,
308 sentences in total with rhetorical labels assigned by law experts was used, while the test
dataset had an additional set of 10 case documents. The dataset was provided by [4]. The 21
runs submitted by the 9 teams employed different methods for rhetorical role labelling. The best
performing system in terms of F-Score and Recall was by team ju_nlp [5] who experimented
with the transformer architecture ROBERTA (state-of-the-art deep learning model) and BiLSTM
with different epochs of the model training for the different runs. Scores attained for F-score
and Recall were 0.468 and 0.501 respectively. Team heu_gjm [6] deployed TF-IDF features and
deep semantic features using BERT, with different classifiers namely Logistic Regression, Linear
Kernel SVM and AdaBoost. The BERT model with Logistic Regression gave the best precision for
the task at 0.541. Team double_liu [7] used bag-of-words based features with SVM and Adaboost
as classifiers. The team also used the BERT model, which outperformed all systems submitted in
terms of accuracy at 0.619. Results from the task show that even with the use of complex deep
learning methods rhetorical labelling remains a difficult problem to solve as none of the methods
proposed achieves optimal performance. In this work we attempt to address the rhetorical role
labelling problem through the use of a fastText classifier. FastText is a linear classifier which has
been shown to perform on par with deep learning algorithms in text classification while training
at faster speeds and utilizing less processing power [8, 9]. In addition our choice of the fastText
model is motivated by its capability to support out of dictionary words which can be useful
when working with domain specific corpora. Furthermore, the model allows the use of phrases
as input tokens to preserve word order, a practice that has proven effective for classification
problems [10, 11, 12]. Thus alongside exploring the effectiveness of the fastText classifier in the
detection of rhetorical roles we will further investigate the effectiveness of using bigrams and
trigrams in improving classification accuracy.


2. Methodology
Rhetorical Labelling (RL) entails segmenting a document into several coherent sentences and
assigning rhetorical roles to these sentences. A rhetorical role describes a semantic function
that a sentence plays in a document. The task calls for the labelling of sentences into seven
roles as follows: Facts referring to the chronology of events that led to filling the case, ruling by
lower court, Arguments of contending parties, relevant cited statute, relevant precedent cited,
ratio of the decision referring to rationale/reasoning given for the final judgement and ruling by
present court referring to the final decision given by the court. For our study, the task will be
approached as a classification problem where each role is considered a class, and each instance
of a sentence in a document is classified into only one of the classes. In our experiments we
deploy a supervised fastText text classifier trained on the provided Task 1 dataset.
   Fasttext 1 is an open source toolkit developed for effective learning of text representations
and text classification [6]. FastText incorporates the context of words in its embeddings as
surrounding words are taken into account when learning a word representation. Furthermore
fastText represents each word as a bag of character n-grams in addition to the word itself which
    1
        https://fasttext.cc/
is useful for corpora with rare non-dictionary words. Text representations are obtained by
averaging word representations. The representations are then fed into a linear classifier and
classes determined by deploying a loss function that computes probability distribution over
predefined class labels. By default fastText accepts unigrams as input tokens, however this
can be varied to for instance bigrams and trigrams. The loss function generally used is the
softmax which is can be changed to hierarchical softmax for larger number of classes to speed
up training.


3. Experimental Setup
3.1. Dataset
The training dataset consists of 70 documents with variable number of sentences of different
lengths. The test data consists of 10 documents also with varying number of sentences. Each of
the sentences in the training dataset is annotated with one of the seven classes. It was noted
that the data was unbalanced with an unequal distribution among the classes. Measures to be
used for evaluation are Precision, Recall and F1 Score.

3.2. Platform
The Python Programming Language and its libraries is used for all experiments. The Fasttext
Open Source Library is used for classification.

3.3. Pre-Processing
Training data sentences are converted to lower case, contractions fixed and punctuations
removed. The NLTK library is used to remove stop words. The Porter Stemmer is used to stem
the words. To conform to fasttext input file requirements, each sentence is rearranged and a
prefix “ __label__” affixed to the start of each class label. The final format for each sentence is
shown in the example below:
  __label__Facts none of her children survived her
  For training, data is converted into input text files for training and validation using a ratio of
70/30.


4. Runs Description
In our approach we consider the influence of word order in improving performance. We
therefore train a classifier with similar parameters while varying the length of word tokens. The
submitted runs are for the different models of the classifier obtained for different input tokens.
Each test sentence was pre-processed to lower case, fix contractions, remove stop-words and
also stem the words. The Porter Stemmer is used to stem the words.
Table 1
UB_BW Results by Role
      Run          Measure     Argument    Facts    Precedent   Ratio of    Ruling     Ruling       Statute
                                                                the Deci-   by Lower   by
                                                                sion        Court      Present
                                                                                       Court
 UB_BW RUN 1       Precision    0.3878     0.5236    0.191      0.6331      0.0        0.3721       0.0
                     Recall     0.4872     0.4644    0.5075     0.4897      0.0        0.6154       0.0
                    F-Score     0.4318     0.4922    0.2776     0.5523      0.0        0.4638       0.0
 UB_BW RUN 2       Precision    0.4571     0.597     0.1823     0.6212      0.0        0.4857       0.0
                     Recall     0.4103     0.5021    0.5224     0.5103      0.0        0.6538       0.0
                    F-Score     0.4324     0.5455    0.2703     0.5603      0.0        0.5574       0.0
 UB_BW RUN 3       Precision    0.4545     0.585     0.1943     0.6198      0.0        0.500        0.0
                     Recall     0.3846     0.4895    0.5075     0.5446      0.0        0.6538       0.0
                    F-Score     0.4167     0.533     0.281      0.5798      0.0        0.5667       0.0


4.1. UB_BW RUN 1
For our baseline the fastText classifier model is trained for 25 epochs at a learning rate of 0.5
with WordNgrams set to unigrams.

4.2. UB_BW RUN 2
In an effort to improve performance, in our second run we use bigrams as our input tokens while
the model’s learning rate and epochs remain at 25 and 0.5 respectively. A slight improvement is
noticed over the baseline in terms of both training accuracy and precision.

4.3. UB_BW RUN 3
In our third run the model’s parameters are retained as per the two previous runs, however
the input tokens are set to trigrams. A negligible improvement is noted in terms of training
accuracy and precision over the second run. Training data results based on Precision, and
performance accuracy (on the training set) are shown in the Table 2 and Table 1.


5. Results and Analysis
The performance of our runs relative to other teams systems on the test data is shown in Table 2.
For our baseline system we used unigrams as input tokens while for the second and third
systems bigrams and trigrams were used respectively. It can be observed from the results that
the system that the trigrams based system UB_BW RUN 3 performed much better than the
baseline that used unigrams UB_BW RUN 1 across all measures. However a negligible difference
is noticed between the trigrams and bigrams system UB_BW RUN 2. A category wise analysis
of the results extracted from the results as shown in Table 1 shows all systems performed poorly
in terms of predicting labels for the classes Ruling by Lower Court and Statute. It can also
Table 2
Results if Task 1: Rhetorical Role Labeling for Legal Judgements
                           RUN ID                    PRECISION     RECALL    F-SCORE
                    RUSTIC RUN 1                        0.548       0.616    0.557
                    RUSTIC RUN 2                        0.528       0.619    0.551
                    RUSTIC RUN 3                        0.511       0.627    0.549
                   MINITRUE RUN 1                       0.485       0.572    0.517
                  ARGUABLY RUN 1                        0.465       0.591    0.505
                   MINITRUE RUN 3                       0.461       0.57     0.503
                   MINITRUE RUN 2                       0.46        0.565    0.501
                   SSN_NLP RUN 2                        0.451       0.571    0.491
                  ARGUABLY RUN 2                        0.45        0.586    0.491
                   SSN_NLP RUN 3                        0.438       0.571    0.475
                  NITS LEGAL RUN 2                      0.453       0.464    0.451
                  NITS LEGAL RUN 1                      0.441       0.434    0.428
                   SSN_NLP RUN 1                        0.411       0.539    0.409
                 LEGAL AI 2021 RUN 1                    0.394       0.361    0.364
                    UB_BW RUN 3                         0.336       0.369    0.340
                    UB_BW RUN 2                         0.335       0.371    0.338
            CHANDIGARH CONCORDIA RUN 3                  0.317       0.488    0.329
            CHANDIGARH CONCORDIA RUN 2                  0.317       0.485    0.327
                    UB_BW RUN 1                         0.301       0.366    0.317
            CHANDIGARH CONCORDIA RUN 1                  0.29        0.476    0.298
                  LEGAL NLP RUN 3                       0.225       0.227    0.22
                   CEN NLP RUN 2                        0.309                0.199
                  LEGAL NLP RUN 1                       0.197       0.217    0.196
                  LEGAL NLP RUN 2                       0.198       0.215    0.192
                   CEN NLP RUN 1                        0.179       0.194    0.179
                NIT AGARTALA RUN 1                      0.192       0.22     0.179


be noted that for other classes the baseline system and the bigram system outperformed the
trigram system. However the trigram system outperforms the other systems in terms of F-score
for all classes


6. Discussion and Conclusion
In this paper we explored the effectiveness of using phrases with the fastText classifier to assign
rhetorical labels to sentences in a court case document. While our systems did not give good
performance overall we believe that with enhancements and more training data the fastText
classifier has potential to benefit the rhetorical labelling task. We also observe performance
with the introduction of bigrams and trigrams in the model which indicates that phrases can
have a positive influence in a classification task. Going forward we aim to further investigate
the influence of phrases in improving text classification by performing empirical evaluation
with various models.
References
 [1] P. Bhattacharya, P. Mehta, K. Ghosh, S. Ghosh, A. Pal, A. Bhattacharya, P. Majumder,
     Overview of the FIRE 2020 AILA track: Artificial intelligence for legal assistance, in:
     P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for
     Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 1–11.
 [2] V. Parikh, U. Bhattacharya, P. Mehta, A. Bandyopadhyay, P. Bhattacharya, K. Ghosh,
     S. Ghosh, A. Pal, A. Bhattacharya, P. Majumder, Overview of the third shared task on
     artificial intelligence for legal assistance at fire 2021, in: FIRE (Working Notes), 2021.
 [3] V. Parikh, U. Bhattacharya, P. Mehta, A. Bandyopadhyay, P. Bhattacharya, K. Ghosh,
     S. Ghosh, A. Pal, A. Bhattacharya, P. Majumder, Fire 2021 aila track: Artificial intelligence
     for legal assistance, in: Proceedings of the 13th Forum for Information Retrieval Evaluation,
     2021.
 [4] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, A. Wyner, Identification of rhetorical roles of
     sentences in indian legal judgments, in: M. Araszkiewicz, V. Rodríguez-Doncel (Eds.), Legal
     Knowledge and Information Systems - JURIX 2019: The Thirty-second Annual Conference,
     Madrid, Spain, December 11-13, 2019, volume 322 of Frontiers in Artificial Intelligence and
     Applications, IOS Press, 2019, pp. 3–12.
 [5] S. B. Majumder, D. Das, Rhetorical role labelling for legal judgements using ROBERTA, in:
     P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for
     Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 22–25.
 [6] J. Gao, H. Ning, Z. Han, L. Kong, H. Qi, Legal text classification model based on text
     statistical features and deep semantic features, in: P. Mehta, T. Mandl, P. Majumder,
     M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation,
     Hyderabad, India, December 16-20, 2020, volume 2826 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2020, pp. 35–41.
 [7] L. Liu, L. Liu, Z. Han, Query revaluation method for legal information retrieval, in:
     P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for
     Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 18–21.
 [8] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification,
     CoRR abs/1607.01759 (2016). arXiv:1607.01759.
 [9] V. Zolotov, D. Kung, Analysis and optimization of fasttext linear text classifier, CoRR
     abs/1702.05531 (2017). arXiv:1702.05531.
[10] R. Johnson, T. Zhang, Effective use of word order for text categorization with convolutional
     neural networks, CoRR abs/1412.1058 (2014). arXiv:1412.1058.
[11] C. Chang, M. Masterson, Using word order in political text classification with long
     short-term memory models, Political Analysis 28 (2020) 395–411.
[12] S. Jameel, W. Lam, L. Bing, Supervised topic models with word order structure for document
     classification and retrieval learning, Inf. Retr. J. 18 (2015) 283–330.