<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>D. Sudharsan);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DistilRoBERTa Based Sentence Embedding for Rhetorical Role Labelling of Legal Case Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Deepthi Sudharsan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asmitha U</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Premjith B</string-name>
          <email>b_premjith@cb.amrita.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soman K.P</string-name>
          <email>kp_soman@amrita.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vishwa Vidyapeetham</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Computational Engineering and Networking (CEN), Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>In a country like India with a very dense and growing population, every year the number of legal judgements filed keep increasing. With increasing number of legal case documents, a systematic and structured organization of the files are essential for the smooth running of the legal system. As a part of AILA 2021, assigning rhetorical roles of legal documents was given as a shared task to automate the process. Deep Learning and Machine Learning models help achieve this task with ease and minimal error. For eficient information retrieval and classification, preprocessing and word embedding techniques such as sentence transformation have been discussed in the paper. Artificial Neural Networks performed the best and consequently, it was used to further evaluate and improve the prediction of the rhetorical roles. In comparison to other Machine Learning and Deep learning models trained for the task, a basic Artificial Neural Network with one hidden layer and 1024 × 2 neurons gave the maximum validation accuracy of 85.18% and testing precision of 30.9%.</p>
      </abstract>
      <kwd-group>
        <kwd>Documents</kwd>
        <kwd>Rhetorical Role labelling</kwd>
        <kwd>distilroberta-base</kwd>
        <kwd>Artificial Neural Networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        For the eficient working and the smooth administration of the court of law, an organized and
eficient structure of storing the legal case documents is obligatory. Manual examination of legal
judgments provided by higher courts or legal oficials for the acquisition of cardinal information
can be a cumbersome and error-prone process. As a result, automatic information retrieval from
legal court case transcripts and employing deep learning techniques to classify those judgments
would provide several advantages to individuals working in the legal services industry [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To
ensure the easy readability of the legal judgments and classifying the documents based on
their common thematic rhetorical roles such as “Facts of the Case” , “Issues being discussed”,
“Arguments given by the parties” etc. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], deep learning networks prove to be eficient. The
Artificial Intelligence for Legal Assistance (AILA 2021) [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] came up with a task that aims to
classify the rhetorical roles of sentences from legal case documents given the seven predefined
roles that it can be classified under.
nEvelop-O
      </p>
      <p>To come up with an eficient and a less error-prone solution for the task, various machine
learning and deep learning models were trained with and without hyperparameter tuning using
GridSearchCV, although it later proved to ineficient for this task. Among the machine learning
models that were trained using K - Nearest Neighbor (KNN), Decision Tree, Random Forest,
Naive Bayes, Multi Layer Perceptron (MLP) and Support Vector Machine (SVM), SVM proved to
be more accurate in predicting the roles with an accuracy of 53%. All the deep learning models
that were trained (Long Short-Term Memory (LSTM) Networks), Artificial Neural Network
(ANN) and Convolutional Neural Network (CNN), performed significantly better than the
machine learning models, with ANN performing better than all the models that were trained
for the task with a validation accuracy of about 85.18 % on the training dataset. The single
layer ANN model was further evaluated for two diferent runs on the testing dataset, and the
performance was analyzed.</p>
      <p>The paper is broadly divided into the following sections: Section 2 introduces related research
in the field of legal document retrieval; Section 3 provides the dataset information; Section 4
explains the methodology proposed for the task; Section 5 discusses the evaluation outcomes.
Finally, Section 6 finishes the work with some suggestions for further improvements for better
outcomes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        In paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], GloVe, Doc2Vec and Term Frequency-Inverse Document Frequency (TFIDF) based
methods were used to perform the labelling of Rhetorical Roles for Legal Judgements given
in AILA 2020 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Manual annotation is significantly used in the automatic labelling of the
rhetorical role of sentences. Other works deal with the process of annotation – producing a
set of rules for annotation, inter - annotator research, and so on – whereas papers that aim
to automate the task of semantic labelling also perform an annotation analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Classifier
models such as fastText have also been proposed as an approach for searching through legal
facts from case documents as discussed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], BM25 ranking algorithm was used for
identifying relevant prior cases for a given situation based on best matches. Similar work
had been done in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a part of AILA 2019 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] while additionally using cosine similarity
and jaccord similarity. With the rise of Deep learning applications for the purpose of legal
information retrieval, a high demand for Neural Network based classification is reflected in
many works[12].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Description</title>
      <p>For the given task, the provided training data set consists of over 60 case documents and the
rhetorical roles for the sentences in each document. The predefined rhetorical roles that were
to be predicted are mentioned in Table 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>A model that can successfully predict the rhetorical roles with minimal errors needed to be
designed in legal documents. After successful retrieval of sentences from all the documents and
preprocessing, embedding using pre-trained model available in hugging-face transformer1 was
performed on the data set. A variety of machine learning and deep learning models were trained
and tested to find the optimal model that could be used to perform the task. The proposed
methodology is shown in Figure 1.</p>
      <p>Class Occurrence</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>In the data cleaning process, numerical characters, punctuation and extra white spaces were
removed using regular expression package. The sentences were modified into lowercase in order
to maintain uniformity and using NLTK library2, stop words were further removed. During
exploratory analysis, it was found that 69 sentences were unassigned roles and hence dropped.
The non numerical labels were encoded to numerical attributes using label encoder.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Embedding</title>
        <p>Sentence embedding was accomplished using pretrained sentence transformer distilroberta
base [13] [14]. Distilroberta - base is a technique from Hugging face library that uses contextual
relations between the words to yield contextualized word vector embedding.
1https://huggingface.co/transformers/
2https://www.nltk.org/</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Model Training</title>
        <p>Initially, Machine learning models like KNN, Decision Tree, Random Forest, Naive Bayes, MLP
were trained and SVM classifier was found to have the maximum accuracy of 53 %. To improve
the classification accuracy, deep learning models such as LSTM (Long Short Term Memory),
[15] ANN (Artificial Neural Networks) and CNN (Convolutional Neural Networks) [ 16] were
trained. Out of the three deep learning models, ANN had comparatively higher accuracy of
about 99%. While training, to address the imbalance in the data set 1, class weights [17] were
generated and passed as input parameter to the models. For improving the overall performance
of the models, GridSearchCV3 was used on the models to get the best parameters, but there was
no improvement in the accuracy when the best parameters yielded by GridSearchCV was used.
Hence, ANN model was chosen to perform the classification task. The structure of the selected
ANN model is depicted in Figure 2.</p>
        <p>The artificial neural network used has one hidden layer which is connected to the input
and output layers. Embedded vectors after passing through the input layer and neurons in the
hidden layer is then decoded using inverse transform of the label encoder.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussions</title>
      <p>The recorded accuracies for diferent number of layers and neurons after running for 32 epochs
are compared in the Table 2.</p>
      <p>3https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</p>
      <p>As observed, for one layer without dropout, the training and validation accuracy seems to be
substantially high but it is noticed that for 1024 neurons, the model was over fitting and hence
1024 × 2 neurons was used. The ANN model was run using these optimal parameters for two
cases.</p>
      <p>In the first run, both the train and test data were embedded together whereas in the second
run they were embedded separately.</p>
      <p>From Table 3, it is observed that run two performed better with a precision of 30.9 % than
the first run which gave a precision of just 17.9 %. Hence, the training and testing data were
embedded separately to achieve better results.</p>
      <p>Table 4 shows the category - wise comparison of the precision, recall and Fscore metrics
for both the runs. The single layered ANN architecture that was used predicted the ”Ruling
by lower court” and ”Statute” labels incorrectly for both the runs. Run 2 was able to predict
”Argument”, ”Facts”, ”Ratio of the decision” better than Run 1. It is also observed that Run 2
was able to correctly predict ”Ruling by Present Court”. The Rhetorical role ”Precedent” was
predicted better by Run 1 in comparison to run 2 unlike the trend shown by the other labels.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper talks about the systematic approach undertaken to successfully predict the rhetorical
roles of legal documents using multiple machine learning and deep learning techniques. Basic
single layered ANN trained using word embedding from pre - trained sentence transformer,
distilroberta - base can help achieve high precision of 30.9 %. The paper can be further expanded
by using alternate methods like BM25 ranking algorithm and other methods of embedding like
TFIDF or fastText to improve the overall prediction accuracy.
of sentences in indian legal judgments, in: Proc. International Conference on Legal
Knowledge and Information Systems (JURIX), 2019.
[12] S. Mandal, S. D. Das, Unsupervised identification of relevant cases &amp; statutes using word
embeddings, in: FIRE, 2019.
[13] J. Du, E. Grave, B. Gunel, V. Chaudhary, O. Çelebi, M. Auli, V. Stoyanov, A. Conneau,</p>
      <p>Self-training improves pre-training for natural language understanding, in: NAACL, 2021.
[14] A. Barua, S. Thara, B. Premjith, K. Soman, Analysis of contextual and non-contextual word
embedding models for hindi ner with web application for data collection, in: International
Advanced Computing Conference, Springer, 2020, pp. 183–202.
[15] B. Premjith, K. Soman, Deep learning approach for the morphological synthesis in
malayalam and tamil at the character level, Transactions on Asian and Low-Resource Language
Information Processing 20 (2021) 1–17.
[16] T. T. Sasidhar, B. Premjith, K. Soman, Emotion detection in hinglish (hindi+ english)
code-mixed social media text, Procedia Computer Science 171 (2020) 1346–1352.
[17] B. Premjith, K. P. Soman, P. Poornachandran, Amrita_cen@ fact: Factuality identification
in spanish text., in: IberLEF@ SEPLN, 2019, pp. 111–118.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rathnayake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rupasinghe</surname>
          </string-name>
          , N. de Silva,
          <string-name>
            <given-names>M.</given-names>
            <surname>Warushavithana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gamage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perera</surname>
          </string-name>
          ,
          <article-title>Classifying sentences in court case transcripts using discourse and argumentative properties</article-title>
          ,
          <source>International Journal on Advances in ICT for Emerging Regions (ICTer) 12</source>
          (
          <year>2019</year>
          )
          <article-title>1</article-title>
          .
          <source>doi:1 0 . 4 0</source>
          <volume>3 8</volume>
          / i c t e r .
          <source>v 1 2 i 1 . 7 2</source>
          <volume>0 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wyner</surname>
          </string-name>
          ,
          <article-title>Identification of rhetorical roles of sentences in indian legal judgments</article-title>
          ,
          <source>in: Legal Knowledge and Information Systems: JURIX</source>
          <year>2019</year>
          :
          <article-title>The Thirty-second Annual Conference</article-title>
          , volume
          <volume>322</volume>
          , IOS Press,
          <year>2019</year>
          , p.
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , U. Bhattacharya,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Overview of the third shared task on artificial intelligence for legal assistance at fire 2021</article-title>
          , in: FIRE (Working Notes),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , U. Bhattacharya,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Fire 2021 aila track: Artificial intelligence for legal assistance</article-title>
          ,
          <source>in: Proceedings of the 13th Forum for Information Retrieval Evaluation</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Almuslim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Inkpen</surname>
          </string-name>
          ,
          <article-title>Document level embeddings for identifying similar legal cases and laws</article-title>
          , in: FIRE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Overview of the fire 2020 aila track: Artificial intelligence for legal assistance</article-title>
          ,
          <source>in: FIRE (working notes)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Šavelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          , Segmenting U.S.
          <article-title>court decisions into functional and issue specific parts</article-title>
          ,
          <source>in: JURIX</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Nejadgholi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Bougueng</given-names>
            <surname>Tchemeube</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Witherspoon</surname>
          </string-name>
          ,
          <article-title>A semi-supervised training method for semantic search of legal facts in canadian immigration cases</article-title>
          ,
          <source>2017. doi:1 0 . 3 2</source>
          <volume>3 3 / 9 7 8 - 1 - 6 1 4 9 9 - 8 3 8 - 9 - 1 2</volume>
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Gain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          , A. De,
          <string-name>
            <given-names>T.</given-names>
            <surname>Saikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekbal</surname>
          </string-name>
          , Iitp at aila 2019:
          <article-title>System report for artificial intelligence for legal assistance shared task</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kayalvizhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aravindan</surname>
          </string-name>
          ,
          <article-title>Legal assistance using word embeddings</article-title>
          ,
          <source>in: FIRE</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wyner</surname>
          </string-name>
          , Identification of rhetorical roles
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>