The Language Model for Legal Retrieval and Bert-based Model
for Rhetorical Role Labeling for Legal Judgments

Yujie Xub, Tang Lib, Zhongyuan Hana,*
     a
         Foshan University, Foshan, China
     b
         Heilongjiang Institute of Technology, Harbin, China


                  Abstract
                  This paper mainly introduces the solutions to the two tasks published in FIRE2020(forum for
                  information retrieval evaluation), For Task1 (statistic retrieval), The task 1 is, for a given
                  query(description of a situation), identify relevant statutes and prior-cases. This task includes
                  two subtasks, Task1a (identifying relevant prior cases) and Task1b (identifying relevant
                  statistics), For these two subtasks, we use the language model to score each query, and then
                  rank them according to the score. For Task2(rhetorical role labeling for legal judgments), It
                  requires us to classify sentences. We think it's a multi-classification problem, and finally, we
                  use Bert to complete the classification task. In the final result, the score of Task1a is 0.125,
                  the score of Task1b is 0.2003, and the accuracy of Task2 is 0.549. The results and
                  experiments show that the language model is a better way to complete Task1 and Bert is
                  better to complete task2.

                  Keywords 1
                  Legal Retrieval, Rhetorical Role Labeling, Language Model, Bert,

1. Introduction
     With the gradual maturity of the social legal system, laws and regulations have become more
 detailed and standardized, and people's demand for legal aid is gradually increasing. Compared
 with the low efficiency of artificial legal aid, a series of advantages such as high efficiency and
 high accuracy of artificial intelligence legal aid is gradually highlighted.
     In this regard, FIRE2020 proposed a task and named it AILA2020 (artificial intelligence for
 legal assistance) to improve the legal aid of artificial intelligence, For the two subtasks in Task1,
 they provided 10 short descriptions of a legal situation, 3000 judgments delivered by the
 Supreme Court of India and 197 statutes (Sections of Acts) from Indian law. Retrieve the most
 relevant case documents or statements for a given query. For Task 2 they provide 8096
 rhetorical sentences as training data and 1905 test data, Among them, 8096 training data
 sentences are classified into one of the following seven semantic segments / rhetorical roles,
 They are Fact, Ruling by Lower Court, Argument, Statute, Precedent, Ratio of the decision and
 Ruling by Present Court We are required to divide 1905 test data into these seven categories.

2. Methods

2.1 Methods for Task1a
   Fig. 1 describes our method of solving Task1 with the language model.

Forum for Information Retrieval Evaluation 2020, December 16–20, 2020, Hyderabad, India
EMAIL:1520207872xyj@gmail.com (B. 1); itangkk@gmail.com (B. 2); hanzhongyuan@gmail.com (A. 1)(*corresponding author)
ORCID: 0000-0001-8960-9872 (A. 1)
               2020 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
    Fig. 1 method of Task1
    For Task1a, we tried many models to solve the problem. After many experiments, we finally used
 Two-Stage Language Model to solve the problem.
    After getting the data, we remove the keywords such as period, comma and semicolon from all
 queries and query documents. At the same time, we also remove some common words to reduce the
 impact of these words on the results.
    For Two-Stage language Model, the document language model is effectively smoothed in two
 steps. In the first stage, the document language model is smoothed using a Dirichlet prior with the
 collection language model as the reference model. In the second stage, the smoothed document
 language model is further interpolated with a query background language model[1]. We used Indri2
 tool to index the document. In the subsequent retrieval, we used Two-Stage Language Model to
 calculate the similarity between query and document. The similarity is computed using Eq.(1)[1].
 When we get the similarity between each query and document, we sort it. The higher the score, the
 higher the ranking, the more similar the query and document. After many experiments, we found that
 when μ = 2500, λ = 0.8, the retrieval results of Task1a is the best.

                     m                                      m
                                                                        c(qi , d )  p(qi | S )
p(q | ˆD , , u)  ((1  ) p(qi | ˆD )  p(qi | u))  ((1  )                              p(qi | u))    （1）
                    i 1                                    i 1                | d | 

 2.2 Methods for Task1b
        For Task1b, we not only choose the method of Task1a, but also choose the Jelinek-Mercer
     language model[2] to calculate the similarity between query and document. Eq.(2) is used to
     calculate the similarity between query and document. Before retrieval, we also process the given
     data by word-based n-gram and character-based n-gram. We find that the performance of
     character-based n-gram is much better than that of word-based n-gram, while the n-gram based
     on 2-7 achieves the best result.

                             p ( w | ˆD )   D p ML ( w | ˆD )  (1   D ) p ( w | ˆC )                     （2）


 2.3 Methods for Task2
        For Task2, we think that this is a multi-classification problem. We use the Logistic
     Regression Model and lighter version of Bert3. The weight of bert is set with uncased_ L-12_ H-
     768_ A-12. 8096 training data without any processing are used to fine-tuning the Bert model
     with the parameters(max-Len = 124, batch_ Size = 24, units = 7, epoch = 2).

 3. Experimental Setting
 2
     http://www.lemurproject.org/
 3
     https://github.com/bojone/bert4keras
3.1 Parameter Selection

For Task1a, we tried to take different values of μ and λof Two-Stage Language Model to observe
their effects. In fig. 2, we take the results of different λ when μ= 1500 and μ= 2500.


Fig. 2
Experimental results of different parameter combinations in Task1a

   For Task1b, we tried to take different values of μ and λ of Two-Stage Language Model to
observe their effects. In Fig. 3, we take the results of different λ when μ=1500 and μ= 2500.


Fig. 3
Experimental results of different parameter combinations in Task1b

   In Task1b, we tried to select different n-gram processing to observe their effects. The experimental
results are shown in Table 1.
   In conclusion, μ = 2500, λ = 0.8 can achieve better results. In Task1b processing, character level
2 + 3 + 4 + 5 + 6 + 7 has higher accuracy than other results.
Table 1
Experimental results of different n-gram processing combinations in Task1b
        NO.                             N-gram processing                          Map
         1                                 Char-2gram                              0.0728
         2                                Char-2+3gram                             0.1197
         3                               Char-2+3+4gram                            0.1329
         4                             Char-2+3+4+5gram                            0.1360
         5                            Char-2+3+4+5+6gram                           0.1368
         6                           Char-2+3+4+5+6+7gram                          0.1473
         7                          Char-2+3+4+5+6+7+8gram                         0.1436
         8                         Char-2+3+4+5+6+7+8+9gram                        0.1443
         9                      Char-2+3+4+5+6+7+8+9+10gram                        0.1434
        10                     Char-2+3+4+5+6+7+8+9+10+11gram                      0.1343
        12                   Char-2+3+4+5+6+7+8+9+10+11+12gram                     0.1325


3.2 Experimental Results

    For Task 1, we submitted three groups of results. Table 2 and Table 3 are the experimental
 results of the test data we submitted[3].

Table 2
The performance of our submitted results for Task1a

          Run_ID                        MAP                  BPREF       recip_rank    P @ 10
    fs_hit_2_task1a_01                 0.125                0.0724         0.1906       0.07
    fs_hit_2_task1a_02                 0.0126                 0             0.041       0.02
    fs_hit_2_task1a_03                 0.0123                 0            0.0395       0.02


Table 3
The performance of our submitted results for Task1b

          Run_ID                        MAP              BPREF           recip_rank    P @ 10
    fs_hit_2_task1b_01                 0.2003            0.1587            0.3452       0.1
    fs_hit_2_task1b_02                 0.1777            0.1247            0.2546       0.12
    fs_hit_2_task1b_03                 0.1886            0.132             0.279        0.1

   For Task 2, we submitted two sets of results. Table 4 shows the experimental results of the test
data we submitted.

Table 4
The performance of our submitted results for Task2

        Run_ID                    Precision             Recall           F-Score       Accuracy
       fs_hit2_1                    0.411               0.465            0.405          0.535
       fs_hit2_2                    0.455               0.427            0.398          0.549
4. Conclusions
   This paper introduces the evaluation method we used in FIRE2020 AILA. Compared with other
results, we have exposed many deficiencies. For the task of identifying related prior cases, the final
evaluation results show that BM25 and TF-IDF are better than our methods. while for multi-
classification tasks, Bert shows good results.

5. Acknowledgements
  This work is supported by National Social Science Fund of China (No.18BYY125).

6. References
[1] ChengXiang Zhai, John Lafferty, “Two-Stage Language Models for Information Retrieval”. The
    Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in
    Information Retrieval.
[2] Guodong Ding, Bin Wang, “ GJM-2: A Special Case of General Jelinek-Mercer Smoothing
    Method for Language Modeling Approach to Ad Hoc IR ” . Information Retrieval Technology,
    Second Asia Information Retrieval Symposium, AIRS 2005, Jeju Island, Korea, October 13-15,
    2005, Proceedings.
[3] Bhattacharya, Paheli and Mehta, Parth and Ghosh, Kripabandhu and Ghosh, Saptarshi and Pal,
    Arindam and Bhattacharya, Arnab and Majumder, Prasenjit, Overview of the FIRE 2020 AILA
    track: Artificial Intelligence for Legal Assistance. Proceedings of FIRE 2020 - Forum for
    Information Retrieval Evaluation. Hyderabad, India, December, 2020