<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Labeling for Legal Judgments using Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Racchit Jain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhishek Agarwal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <email>yash@pilani.bits-pilani.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani</institution>
          ,
          <addr-line>Pilani</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the methodologies implemented while performing multi-class text classification for rhetorical role labeling of legal judgments for task 2 of the track 'Artificial Intelligence for Legal Assistance' proposed by the Forum of Information Retrieval Evaluation in 2020. A transformer-based language model (RoBERTa) pretrained over an extensive English language corpus and fine-tuned on legal judgments from the Supreme Court is presented along with its evaluation on standard classification metrics, precision, recall, and F-score.</p>
      </abstract>
      <kwd-group>
        <kwd>text classification</kwd>
        <kwd>transformers</kwd>
        <kwd>RoBERTa</kwd>
        <kwd>rhetorical role labeling</kwd>
        <kwd>AILA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The meaning of the phrase ”Rhetorical Role labeling” in the legal context corresponds to
identifying the semantic function a sentence serves in the legal document. For example, in
legal case documents, there are facts, precedents, rulings by the court, etc. that form the
document’s semantic structure. With the rapid increase in the digitization of legal documents,
automating the detection of rhetorical roles of sentences in a legal case document can allow easy
summarizing, organization of case statements, case analysis, etc. Due to the high specificity and
technical jargon in legal documents, rhetorical role labeling is an extremely challenging NLP task.
Previous work has focused on using handcrafted features and linguistic approaches towards this
classification problem. Bhattacharya et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduced a neural model (Hierarchical BiLSTM
CRF Classifier) which outperformed the previous approaches by a large margin by making use
of pretrained legal embeddings.
      </p>
      <p>
        The ’Artificial Intelligence for Legal Assistance’ track proposed by FIRE 2020[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], comprised
of two tasks. This paper will discuss task-2 of this track, ’Rhetorical Role Labeling for Legal
Judgments’. Each team was provided with an annotated dataset of 50 Supreme Court legal
documents. Every sentence in the document was given one out of the seven labels: {Facts,
Court}. The presented approach achieved an overall 9th rank among all the runs according to
the evaluated macro F-scores, with a score of 0.442 in comparison to the 1st rank achieving an
F-score of 0.468.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Bhattacharya et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] concluded in their paper that deep learning-based models perform much
better at this task than using only handcrafted linguistic features, their approach made use of
a heirarchial BiLSTM model. Attempts prior to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] involved handcrafted feature extraction.
For example, J. Savelka et al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used CRF[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and machine learning techniques on handcrafted
features for automatically segmenting judgments into high functional and issue-specific parts.
Nejadgholi et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a skip-gram model[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and a semi-supervised approach for search
of legal facts in case documents using a classifier model from the fastText[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] library.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The training dataset provided by AILA 2020 contained 50 annotated legal text documents,
wherein every sentence was given a particular rhetorical role, as explained below:
1. Facts (FAC): sentences that contain information related to the timeline of events that
resulted in the case filing.
2. Ruling by Lower Court(RLC): The cases mentioned in the dataset were given a
preliminary ruling by the lower courts (Tribunal, High Court, etc.). These sentences correspond
to the ruling/decision given by these lower courts.
3. Argument(ARG): sentences that signify the arguments of the opposing parties
4. Statute(STA): relevant statute cited
5. Precedent(PRE): relevant precedent cited
6. Ratio of the decision (Ratio): sentences that denote the rationale/reasoning given by
the Supreme Court for the final judgment
7. Ruling by Present Court(RLC): sentences that denote the final decision given by the</p>
      <p>Supreme Court for that case document
The train data contained 9380 training samples. The training data however, was not balanced in
terms of number of samples per label, as shown in Table 1 so metric scores of labels with more
samples are expected to be better than others. The test data contained 10 legal text documents
without annotation, with a total of 1905 samples.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Technique</title>
      <p>
        This method utilizes Transformers based models for the task of document classification. The
proposed model uses a modified pretrained RoBERTa[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ](a Robust and optimized BERT
pretraining approach) encoder with an extra linear layer added to the pretrained RoBERTa base model,
which was designed as an improvement of BERT[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] by providing advanced masked language
modeling and significantly increasing the magnitude of training data.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Pretraining</title>
        <p>The base model was pretrained in a self-supervised manner with the intention of a Masked
language modeling (MLM) use case. It takes in a sequence of words as an input and masks 15%
of them at random. This sentence is then fed into the model, which tries to predict the masked
words. This pretraining approach lets the model learn the inner working/representation of
the English language through a bidirectional approach towards learning the input sequence
representation while training, this “language modeling” is beneficial for extracting features for
other downstream tasks, which is text classification in our scenario. The datasets on which the
base model was pretrained are as follows:
• BookCorpus
• Stories
• CC-News
• OpenWebText
• English Wikipedia
The above datasets are constituted from large amounts of unfiltered internet data. Therefore
the train data is not entirely neutral in its modeling; thus, there are bound to be some biases in
representing the English language.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Tokenization</title>
        <p>
          A pretrained RoBERTa Tokenizer has been used to get the input ids and corresponding attention
masks for each sentence. The tokenizer uses a byte variant of Byte-Pair Encoding (BPE)[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
The tokenizer defines some special tokens to the input sequence. For example, a tokenized
sentence always starts with the &lt;s&gt; token and is delimited by the &lt;/s&gt; token.
15% of the tokenized sequence is masked. The masked tokens are then processed as follows:
• 80% of them are replaced with a &lt;mask&gt; token
• 10% are replaced with a random token diferent from the original one.
        </p>
        <p>• The rest 10% are left as is.</p>
        <p>As opposed to BERT, the model implements a dynamic masking approach that results in the
masked token changing over the training epochs, making it robust for downstream tasks.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Fine-Tuning over a Legal Corpus</title>
        <p>
          In the submitted run, the original pretrained base RoBERTa model in[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was modified by adding
a single linear layer on top for classification which was used as a sentence classifier. Training
data was fed into the model in batches of 16, and the entire pretrained RoBERTa model and the
additional untrained classification layer were fine-tuned for the particular downstream task of
classifying the legal documents into one of the seven Rhetorical Roles. Adam Optimizer[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
was used while training with a learning rate of 2e-5 and epsilon value set to 1e-8. The said
value was set to 42, and the model was fine-tuned for 4 epochs. The norms of the gradients
were clipped to 1.0 to help prevent the explosive gradient problem.
        </p>
        <p>Although RoBERTa trains on a dataset significantly larger in size than BERT, it reduces the
computation times by employing techniques like distillation, pruning, etc. leading to a smaller
network.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Evaluation</title>
      <p>The submitted model achieved an overall rank of 9 among all the submitted runs based on the
F-score. The model was evaluated on the basis of classic classification metrics - macro averaged
recall, precision and f-score. The metrics were calculated for each label for every document and
then averaged over all the documents to get the overall results: Precision - 0.485, Recall - 0.483
and F1-score - 0.442. The document-wise metrics can be seen in Table 2. It can be observed that
documents d1, d3 and d5 perform better than the rest of the documents, these documents have
a lower number of data for which the train data was less for example, RLC, RPC and STA.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>
        The Transformers approach was chosen over a model that takes in data in a sequential manner
(like LSTMs[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or BiLSTM) to allow parallelism. The proposed model captures the English
language context well since it was pretrained over a large corpus. However, the model is limited
due to the specificity and the jargon encountered while fine-tuning it on the legal judgments
dataset. The model seems to perform well in documents 1, 3, 5, which contained a lower number
of test data for labels which had less train data, therefore given a well-balanced dataset to
ifne-tune on, the model could get better results than mentioned above. Future work on this
approach can be to perform the pretraining of the language model on an extensive legal corpus.
The pretrained tokenizer was also trained on generic English language data. Pretraining the
tokenizer on a legal dataset would increase the quality of the embeddings being given to the
transformer model.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wyner</surname>
          </string-name>
          ,
          <article-title>Identification of rhetorical roles of sentences in indian legal judgments</article-title>
          ,
          <year>2019</year>
          .
          <article-title>a r X i v : a r X i v : 1 9 1 1 . 0 5 4 0 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Overview of the FIRE 2020 AILA track: Artificial Intelligence for Legal Assistance</article-title>
          ,
          <source>in: Proceedings of FIRE 2020 - Forum for Information Retrieval Evaluation</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Savelka</surname>
            ,
            <given-names>K. D.</given-names>
          </string-name>
          <string-name>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Segmenting u.s. court decisions into functional and issue specific parts</article-title>
          ,
          <source>Frontiers in Artificial Intelligence and Applications</source>
          <volume>313</volume>
          (
          <year>2018</year>
          )
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          . URL: https: //doi.org/10.3233/978-1-61499
          <source>-935-5-111. doi:1 0 . 3 2</source>
          <volume>3 3 / 9 7 8 - 1 - 6 1 4 9 9 - 9 3 5 - 5 - 1 1</volume>
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Laferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          , in: ICML,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>I. Nejadgholi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bougueng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Witherspoon</surname>
          </string-name>
          ,
          <article-title>A semi-supervised training method for semantic search of legal facts in canadian immigration cases</article-title>
          , in: JURIX,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <year>2013</year>
          .
          <article-title>a r X i v : a r X i v : 1 3 0 1 . 3 7 8 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Bag of tricks for eficient text classification</article-title>
          ,
          <source>ArXiv abs/1607</source>
          .01759 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , ArXiv abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: NAACL-HLT</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fukamachi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Takeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shinohara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shinohara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arikawa</surname>
          </string-name>
          ,
          <article-title>Byte pair encoding: a text compression scheme that accelerates pattern matching</article-title>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <year>2014</year>
          .
          <article-title>a r X i v : a r X i v : 1 4 1 2 . 6 9 8 0</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . URL: https://doi.org/10.1162%
          <fpage>2Fneco</fpage>
          .
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735.
          <source>doi:1 0 . 1 1</source>
          <volume>6 2</volume>
          / n e c
          <source>o . 1 9 9 7 . 9 . 8 . 1 7</source>
          <volume>3 5 .</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>