=Paper=
{{Paper
|id=Vol-2036/T3-7
|storemode=property
|title=HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents
|pdfUrl=https://ceur-ws.org/Vol-2036/T3-7.pdf
|volume=Vol-2036
|authors=Liuyang Tian,Hui Ning,Leilei Kong,Zhongyuan Han,Ruiming Xiao,Haoliang Qi
|dblpUrl=https://dblp.org/rec/conf/fire/TianNKHXQ17
}}
==HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-2036/T3-7.pdf</pdf>
<pre>
     HLJIT2017@IRLed-FIRE2017: Information Retrieval
                From Legal Documents
                Liuyang Tian                                          Hui Ning                                   Leilei Kong*
      School of Computer Science and                       School of Computer Science and               School of Computer Science and
      Technology, Harbin Engineering                       Technology, Harbin Engineering             Technology, Heilongjiang Institute of
          University, Harbin, China                           University, Harbin, China                   Technology, Harbin, China
       tianliuyang2016@outlook.com                             ninghui@hrbeu.edu.cn                       kongleilei1979@gmail.com

             Zhongyuan Han                                        Ruiming Xiao                                   Haoliang Qi
      School of Computer Science and                     School of Computer Science and                School of Computer Science and
                                                                                                         1

    Technology, Heilongjiang Institute of                Technology, Harbin University of           Technology, Heilongjiang Institute of
        Technology, Harbin, China                     Science and Technology, Harbin, China               Technology, Harbin, China
        Hanzhongyuan@gmail.com                             xiaoruiming11@outlook.com                       haoliang.qi@gmail.com
                                                                                                       2
                                                                                                        State Key Laboratory of Digital
                                                                                                            Publishing Technology
ABSTRACT                                                                    bagging classification methods to identify the catchphrases.
                                                                            Then, we tried a learning to rank method to rank the words
This paper details the approach of implementing the
                                                                            in document to select the catchphrases.
Catchphrase Extraction and Precedence Retrieval tasks to
                                                                               The task of Precedence Retrieval can be viewed as an
be presented at Information Retrieval from Legal
                                                                            information retrieval problem. Its purpose is to retrieve the
Documents by Forum of Information Retrieval Evaluation
                                                                            relevant prior cases for a given current case. We used three
in 2017(Fire2017 IRLeD). For the task of Catchphrase
                                                                            classical models of information retrieval, the language
Extraction, the classification-based and Rank-based
                                                                            model, the BM25, and the vector space model, to retrieve
methods were exploited, and various types of features were
                                                                            the relevant documents for a given current case document.
attempted. With respect to the task of Precedence Retrieval,
                                                                               The rest of this paper is organized as follows. In Section
the language model, the BM25 and the vector space model
                                                                            2, we introduced the methods and related features used in
were employed. Comparisons to other submissions for the
                                                                            Catchphrase Extraction. In Section 3, we described the
same tasks, show the presented methods to be one of the top
                                                                            various search model methods used in Precedence Retrieval.
performers.
                                                                            In Section 4, we reported the experimental setting and
KEYWORDS                                                                    results. In the last section, we concluded our study.
Information Retrieval from Legal Documents, Catchphrase                     2     Method of Catchphrase Extraction
Extraction, Precedence Retrieval
                                                                            Let di be a legal document, and a training corpus can be
1     Introduction                                                          defined as:
                                                                                  D = {(x1 , y1 ), (x 2 , y 2 ),..., (x i , y i ),..., (x n , y n )} (1)
With the recent developments in information technology,
                                                                                           (1)  ( 2)    (n) T
the number of digitally available legal documents has                       where xi  ( xi , xi ..., xi ) , i  1,2,..., n . And yi is the label
rapidly increased. In general, the legal text is long and                   to denote whether the word wi is the catchphrase of di or not.
complex in structure, which makes their thorough reading                    Then, our goal is to learn a model to decide whether a word
time-consuming and strenuous[1]. The task of Information
                                                                            is the catchphrase when given a new di. The classification-
Retrieval from Legal Documents1 is devoted to this problem.
The task is divided into two parts by Forum of Information                  based methods and Ranking-based methods are exploited to
Retrieval Evaluation (FIRE): Catchphrase Extraction and                     learn the model respectively.
Precedence Retrieval.                                                       2.1    Classification Method: Bagging
   The task of Catchphrase Extraction focuses on extracting
                                                                               Bagging is based on bootstrap sampling. First, we use
the catchphrases (short relevant phrases) from legal
                                                                            the M-round bootstrap sampling method to obtain M
documents. We formalized the task of Catchphrase
                                                                            samples containing N training samples. Then, based on
Extraction as a multi-classification firstly and used the
                                                                            these sampling sets, a base learner is trained. Finally, the
                                                                            M-based learners are combined. The problem of multi-
*Corresponding Author
1
  https://sites.google.com/view/fire2017irled/track-description             classification is resolved by a simple voting method[2].
Decision tree and random forest[3] are adopted as our base                                                                   p ( w1 , w2 )
                                                                                 I ( w1 , w2 )   p ( w1 , w2 ) log(                        )       (7)
classifiers. Denoted as Bagging(DTC) and Bagging(RFC)                                                                       p ( w1 ) p ( w2 )
respectively.                                                       where N is the number of word pairs (wi,wj), And MI(wi, wj)
                                                                    represents the mutual information of wi and wj.
2.2     Ranking Method: RankSVM                                         Prior Probabilities Using the gold standard, we build a
RankSVM is a pair-wise learning to rank method that uses            priori keyword set P. P contains the keywords with their idf
the SVM model to solve the ranking problem on document              is greater than two.
pairs. To rank the candidate catchphrase, we trained a                                                                    1, w  P
                                                                                           Pr ior _ score( w, P)                                     (8)
ranking function on a training corpus using the Ranking                                                                   0, w  P
SVM2.
                                                                    3   Methods of Precedence Retrieval
2.3     Features                                                    The task of Precedence Retrieval is to retrieve the prior
We construct the features from five aspects: statistical            cases for a given a current case. We view the current case as
features, position features, syntactic features, mutual             the query, while the prior cases as the documents, and use
information and prior probabilities.                                three classical information retrieval models to resolve the
   Statistical Features We use the term frequency(TF),              problem of precedence retrieval.
inverse document frequency(IDF), TF*IDF, and BM25                   3.1 Language Model method
score of a word in document di as the statistical features.         For the query model and the document model, we use the
   Position Features                                                language model based on Dirichlet Prior Smoothing[8]. The
1) The first time of the word appears.                              relevance between query and document is computed as
                                   num
                                                                    follows.
                                    FirstOccur ( w, d )      (2)           score(q, d )  log p(q | d )
          FirstOccur _score(w)  i 1
                                            num                                                   c( wi , d )
                                      precede ( w , d )                       c( w, q) log[1               ]  n log  d (9)
             FirstOccur ( w , d )                            (3)             qi d              p( wi | C )
                                          len ( d )                                wi q

where num is the number of legal documents, precede(w,d)            3.2 Probability Model
represents the number of words in front of the first                The second search model we chose is BM25 model[9]. The
occurrence of the word w in the legal document d, len(d) is         relevance score is computed as follows.
the number of words of d.                                                                    n
                                                                                                        tf ( wi , d )  idf ( wi )  (k1  1)
                                                                          BM 25   log(                                                          )
2) sentence-initial or sentence-end position.                                              wi q
                                                                                                   tf ( wi , d )  k1  (1  b  b 
                                                                                                                                        len(d )
                                                                                                                                                )
                                                                                                                                                      (10)
                                    num
                                                                                                                                         avdl
                                     InFirstLas t ( w, d )   (4)   where q is the query set, d is the candidate document, avdl
        InFirstLas t_ score ( w)  i 1                             is the average length of the document, k1 and b are the
                                                          num
                                         precede( w, d )            adjustment parameters.
           FirstOccur( w, d )                                (5)   3.3 Vector Space Model
                                                 len(d )
                                                                    We also used lucene[10] which implemented the vector
    POS features We also choose the part-of-speech of a
                                                                    space model to estimate the relevance of query and
word as the features. For each sentence, we get the POS of
                                                                    document. The Formula is shown as follows:
each word using the Stanford POS Tagger[6]. We choose a                 score (q, d )  coord (q, d )  queryNorm(q ) 
subset of POS (i.e. NNS, NNPS, NNP, NN, VBZ, VBP,
                                                                         (tf (t in d )  idf (t )  t.getBoost ()  norm(t , d ))
                                                                                                         2                                            (11)
VBN, VBG, VBD, VB, TO, JJ, RB) as our features.
                                                                        t in q
    num is the number of legal documents, countFL(w,d)
represents the number of occurrences of the word w in the              Here, t represents the term containing the domain
first of sentence or the end of the sentence in the legal           information; coord (q, d) means that when a document
document d.                                                         contains more search terms, the higher score of the
    Mutual Information High quality keywords should be              document. tf(t in d) represents the word frequency that
semantically related. If the relevance between a word w and         appears in document d; idf(t) word reverse document
a keywords k is low, then w may not be suitable as a                frequency; norm (t, d) represents the normalization factor;
keyword[7]. The degree of correlation between words is
measured by the average mutual information.                         4   Experimental Results
                                  1                                 4.1 Results of Catchphrase Extraction
          MI ( w1 , w2 ,...wk )             MI (wi , w j )
                                  n i , j 1, 2,...k ;i  j
                                                              (6)   4.1.1 Dataset
                                                                    In this task, a set of legal documents (Indian Supreme Court
                                                                    decisions) are provided. For a few of these documents
2
    http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html
2
(training set), the catchphrases (gold standard) are provided.    evaluation metrics. We surmise that it is mainly because of
Catchphrases are short phrases from within the text of the        the sequence of submitted catchphrases. The two results of
document. These catchphrases have been obtained from a            the classification methods are sorted by alphabetical order,
well-known         legal     search    system      Manupatra      while the results of ranking are submitted in descending
(www.manupatra.co.in), which employs legal experts to             order of sorted scores.
annotate case documents with catchphrases. The rest of the           In addition, we only choose the words not the phrases as
documents will be used as the test set. The dataset of            the catchphrases. We tried to use some rule-based methods
Catchphrase Extraction contains 100 training examples and         to construct phrases according to the word we extracted
300 testing examples.                                             from the legal documents, but we have not achieved the
4.1.2 Experimental Settings                                       improvement on performance. The low scores on MRP,
    Firstly, we get a candidate catchphrase set. The statistics   MP@10, MR@100 and Overall Recall maybe caused by
on the training corpus of catchphrase extraction show that,       this reason.
the nouns accounted for 68.85%, the verbs accounted for           4.2 Results of Precedence Retrieval
9.13%, adjectives accounted for 12.49%, prepositions              4.2.1 Dataset
accounted for 5.90%, adverbs accounted for 1.43%, other           Task 2 provides two data sets, the 200 current cases
types accounted for 2.20%. According to the above                 (Query_docs) which formed by removing the links to the
distribution, we choose noun, verbs, adjectives and adverbs.      2000 prior cases and the prior cases which have been cited
    On the candidate set, We have different combinations of       by the cases in Query_docs along with some random cases
features in the classification and ranking methods. The           (not among Query_docs).
bagging model uses TF*IDF, BM25, Position Features,
                                                                  4.2.2 Experimental Settings
POS, Mutual Information and Prior Probabilities as feature
                                                                     We do the same pre-processing for query extraction and
set. The RankSVM uses TF*IDF, POS and Prior
                                                                  index building. In order to discard some of the interference
Probabilities as feature set.
                                                                  information in document, we filter the document through
    A detailed description of the method parameter settings
                                                                  the lexical information only the nouns, verbs, adjectives,
is shown in the following Table 1:
                                                                  and Poter stemming, lower case and removing stop words
                                                                  are also implemented. For Language Model, we set the
    Table 1: Parameter setting of the model in Task1              parameter mu=10000 and lambda=0.5, and for BM25, we
                                                                  set k1=1.8 and b=0.7.
Method                             Parameter                      4.2.3 Results
Bagging(DTC)       min_samples_split=2, n_estimators=66,             In Task 2, We have submitted three group of results
                    max_samples=0.5, max_features=0.5             Language Model(HLJIT2017_IRLeD_Task2_1), Vector
Bagging(RFC)       n_estimators=78, min_samples_split=2,          Space        Model(HLJIT2017_IRLeD_Task2_2)       and
                    max_samples=0.5,max_features=0.5              BM25(HLJIT2017_IRLeD_Task2_3). The results are
RankSVM                           c=16.0                          shown in Table 3.
4.1.3 Results
                                                                       Table 3: FIRE 2017 IRLeD Run Evaluation for
We submitted the results of the three groups, bagging
                                                                                   Precedence Retrieval
(DTC)(denoted as HLJIT2017_IRLeD_Task1_1), bagging
(RFC)(denoted as HLJIT2017_IRLeD_Tas-k1_2) and                    Method             Language                   Vector Space
RankSVM(denoted as HLJIT2017_IRLeD_Task-1_3). The                                                    BM25
                                                                                      Model                       Model
experimental results are shown in Table 2.
                                                                  MAP                 0.3291         0.1784       0.2479
     Table 2: FIRE 2017 IRLeD Run Evaluation for                  Mean reciprocal     0.6325         0.4074       0.5246
                Catchphrase Extraction                            Rank
                                                                  Precision@10        0.2180         0.1290        0.1665
Method                   bagging       bagging                    Recall@100          0.6810         0.5950        0.6710
                                                   RankSVM
                          (DTC)         (RFC)                        From the experimental results, the language model is
Mean R Precision         0.0297        0.0335        0.0864       much better than the other two models. The MAP of the
Mean Precision at10      0.0576          0.06        0.1220       language model is 0.3291, the MAP of the vector space
Mean Recall at100        0.0328        0.0440        0.1514       model is 0.2479, and the lowest of the probability model is
                                                                  0.1784.
Overall Recall           0.0328        0.0440        0.1519
                                                                     Some methods which can improve the performance of
MAP                      0.1401        0.1241        0.1649
                                                                  language model, such query extension and document
   Note that our three submitted results are closed on the        extension, have not yet been applied in this evaluation. It
evaluation metrics MAP but different on the other                 may be our further work. In addition, to do the smoothing

                                                                                                                            3
of the document D, we only apply the given set of legal       References
documents; Too small collection of documents, resulting in    [1]  A. Mandal, K. Ghosh, A. Bhattacharya, A. Pal and S. Ghosh.
the sparse words, And documents and query models also              Overview of the FIRE 2017 track: Information Retrieval from Legal
failed to adjust to the optimal. We believe that these             Documents (IRLeD). In Working notes of FIRE 2017 - Forum for
                                                                   Information Retrieval Evaluation, Bangalore, India, December 8-10,
methods will improve the performance of the search model           2017, CEUR Workshop Proceedings. CEUR-WS.org, 2017.
and will be tried in future research.                         [2] Du P, Xia J, Zhang W, et al. Multiple classifier system for remote
                                                                   sensing image classification: A review[J]. Sensors, 2012, 12(4): 4764-
5   Conclusions                                                    4792.
                                                              [3] Liaw A, Wiener M. Classification and regression by randomForest[J].
   We described the approach to resolve the problems of            R news, 2002, 2(3): 18-22.
Catchphrase Extraction and Precedence Retrieval in            [4] Joachims T. Learning to classify text using support vector machines:
Fire2017 IRLeD task.                                               Methods, theory and algorithms[M]. Kluwer Academic Publishers,
                                                                   2002.
   For the task of Catchphrase Extraction, we tried to the    [5] Hulth A. Improved automatic keyword extraction given more
classification-based and ranking-based methods. Various            linguistic knowledge[C]. Proceedings of the 2003 conference on
type of features is integrated into our models. The                Empirical methods in natural language processing. Association for
experiments show that the ranking-based model achieved             Computational Linguistics, 2003: 216-223.
                                                              [6] Toutanova K, Klein D, Manning C D, Singer Y, 2003. Fea-ture-rich
better performance.                                                part-of-speech tagging with a cyclic depend-ency network. In Proc.
   For the task of Precedence Retrieval, we have only tried        the 2003 Conference of the North American Chapter of the
several basic language models, such as language model,             Association for Com-putational Linguistics on Human Language
probability model and vector space model. Experiments              Technology, p. 173–180.
                                                              [7] Turney P D. Coherent keyphrase extraction via web mining[J]. arXiv
show that the language model is more excellent than vector         preprint cs/0308033, 2003.
space model and BM25.                                         [8] Song F, Croft W B. A general language model for information
                                                                   retrieval[C]. Proceedings of the eighth international conference on
Acknowledgments                                                    Information and knowledge management. ACM, 1999: 316-321.
                                                              [9] Robertson S, Zaragoza H. The probabilistic relevance framework:
This work is supported by the National Natural Science             BM25 and beyond[J]. Foundations and Trends in Information
Foundation of China (No.61772177) and the Special subject          Retrieval, 2009, 3(4): 333-389.
                                                              [10] McCandless M, Hatcher E, Gospodnetic O. Lucene in Action: Covers
of State Key Laboratory of Digital Publishing Technology           Apache Lucene 3.0[M]. Manning Publications Co., 2010.
(The research on Plagiarism Detection-From Heuristic to
Machine Learning).


4

</pre>