-

HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents

Liuyang Tian

tianliuyang2016@outlook.com 3

Hui Ning

ninghui@hrbeu.edu.cn 3

Zhongyuan Han

Hanzhongyuan@gmail.com 5

Ruiming Xiao

xiaoruiming11@outlook.com 4

Leilei Kong*

kongleilei1979@gmail.com 5

Haoliang Qi

haoliang.qi@gmail.com 0

2State Key Laboratory of Digital

Information Retrieval from Legal Documents, Catchphrase

1 0 1School of Computer Science and, Technology, Heilongjiang Institute of, Technology , Harbin , China 1 Extraction , Precedence Retrieval 2 Publishing Technology 3 School of Computer Science and, Technology, Harbin Engineering, University , Harbin , China 4 School of Computer Science and, Technology, Harbin University of, Science and Technology , Harbin , China 5 School of Computer Science and, Technology, Heilongjiang Institute of, Technology , Harbin , China

This paper details the approach of implementing the Catchphrase Extraction and Precedence Retrieval tasks to be presented at Information Retrieval from Legal Documents by Forum of Information Retrieval Evaluation in 2017(Fire2017 IRLeD). For the task of Catchphrase Extraction, the classification-based and Rank-based methods were exploited, and various types of features were attempted. With respect to the task of Precedence Retrieval, the language model, the BM25 and the vector space model were employed. Comparisons to other submissions for the same tasks, show the presented methods to be one of the top performers.

( 1 ) where xi  (xi( 1 ) , xi( 2 )..., xi(n) )T ,i  1,2,..., n . And yi is the label to denote whether the word wi is the catchphrase of di or not. Then, our goal is to learn a model to decide whether a word is the catchphrase when given a new di. The classificationbased methods and Ranking-based methods are exploited to learn the model respectively.

2.1 Classification Method: Bagging

Bagging is based on bootstrap sampling. First, we use the M-round bootstrap sampling method to obtain M samples containing N training samples. Then, based on these sampling sets, a base learner is trained. Finally, the M-based learners are combined. The problem of multiclassification is resolved by a simple voting method[2]. Decision tree and random forest[3] are adopted as our base classifiers. Denoted as Bagging(DTC) and Bagging(RFC) respectively.

2.2 Ranking Method: RankSVM

RankSVM is a pair-wise learning to rank method that uses the SVM model to solve the ranking problem on document pairs. To rank the candidate catchphrase, we trained a ranking function on a training corpus using the Ranking SVM2.

2.3 Features

num We construct the features from five aspects: statistical features, position features, syntactic features, mutual information and prior probabilities.

Statistical Features We use the term frequency(TF), inverse document frequency(IDF), TF*IDF, and BM25 score of a word in document di as the statistical features.

Position Features

1) The first time of the word appears.

num  FirstOccur (w, d ) FirstOccur _score(w)  i1 ( 2 )

FirstOccur (w, d )  precleedne(d( w), d ) ( 3 ) where num is the number of legal documents, precede(w,d) represents the number of words in front of the first occurrence of the word w in the legal document d, len(d) is the number of words of d. 2) sentence-initial or sentence-end position.

num  InFirstLas t(w, d ) InFirstLast_ score (w)  i1 ( 4 ) num

FirstOccur(w, d )  precleend(ed( w), d ) ( 5 )

POS features We also choose the part-of-speech of a word as the features. For each sentence, we get the POS of each word using the Stanford POS Tagger[6]. We choose a subset of POS (i.e. NNS, NNPS, NNP, NN, VBZ, VBP, VBN, VBG, VBD, VB, TO, JJ, RB) as our features.

num is the number of legal documents, countFL(w,d) represents the number of occurrences of the word w in the first of sentence or the end of the sentence in the legal document d.

Mutual Information High quality keywords should be semantically related. If the relevance between a word w and a keywords k is low, then w may not be suitable as a keyword[7]. The degree of correlation between words is measured by the average mutual information.

1 MI (w1, w2 ,...wk )  ( 6 )  MI (wi , wj ) n i, j1,2,...k;i j 2 http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html

I (w1, w2 )    p(w1, w2 ) log( pp(w(w1)1p,w(w2)2 ) ) ( 7 ) where N is the number of word pairs (wi,wj), And MI(wi, wj) represents the mutual information of wi and wj.

Prior Probabilities Using the gold standard, we build a

priori keyword set P. P contains the keywords with their idf is greater than two.

Pr ior _ score(w, P) 

1, w P 0, w P ( 8 )

3 Methods of Precedence Retrieval

The task of Precedence Retrieval is to retrieve the prior cases for a given a current case. We view the current case as the query, while the prior cases as the documents, and use three classical information retrieval models to resolve the problem of precedence retrieval.

3.1 Language Model method

For the query model and the document model, we use the language model based on Dirichlet Prior Smoothing[8]. The relevance between query and document is computed as follows.

score(q, d )  log p(q | d )

qwiidq   c(w, q) log[1 c(wi , d ) p(wi | C) ]  n log d ( 9 )

3.2 Probability Model

The second search model we chose is BM25 model[9]. The relevance score is computed as follows.

BM 25  winqlog(tf ( wtfi (,wdi),d k)1id(1f (wbi ) b(k1len1()d ))) ( 10 ) avdl where q is the query set, d is the candidate document, avdl is the average length of the document, k1 and b are the adjustment parameters.

3.3 Vector Space Model

We also used lucene[10] which implemented the vector space model to estimate the relevance of query and document. The Formula is shown as follows: score(q, d )  coord (q, d )  queryNorm(q)   (tf (t in d )  idf (t)2  t.getBoost()  norm(t, d )) (11) t in q

Here, t represents the term containing the domain information; coord (q, d) means that when a document contains more search terms, the higher score of the document. tf(t in d) represents the word frequency that appears in document d; idf(t) word reverse document frequency; norm (t, d) represents the normalization factor;

4 Experimental Results 4.1 Results of Catchphrase Extraction 4.1.1 Dataset

In this task, a set of legal documents (Indian Supreme Court decisions) are provided. For a few of these documents (training set), the catchphrases (gold standard) are provided. Catchphrases are short phrases from within the text of the document. These catchphrases have been obtained from a well-known legal search system Manupatra (www.manupatra.co.in), which employs legal experts to annotate case documents with catchphrases. The rest of the documents will be used as the test set. The dataset of Catchphrase Extraction contains 100 training examples and 300 testing examples.

4.1.2 Experimental Settings

Firstly, we get a candidate catchphrase set. The statistics on the training corpus of catchphrase extraction show that, the nouns accounted for 68.85%, the verbs accounted for 9.13%, adjectives accounted for 12.49%, prepositions accounted for 5.90%, adverbs accounted for 1.43%, other types accounted for 2.20%. According to the above distribution, we choose noun, verbs, adjectives and adverbs.

On the candidate set, We have different combinations of features in the classification and ranking methods. The bagging model uses TF*IDF, BM25, Position Features, POS, Mutual Information and Prior Probabilities as feature set. The RankSVM uses TF*IDF, POS and Prior Probabilities as feature set.

A detailed description of the method parameter settings is shown in the following Table 1:

4.1.3 Results

We submitted the results of the three groups, bagging (DTC)(denoted as HLJIT2017_IRLeD_Task1_1), bagging (RFC)(denoted as HLJIT2017_IRLeD_Tas-k1_2) and RankSVM(denoted as HLJIT2017_IRLeD_Task-1_3). The experimental results are shown in Table 2.

Note that our three submitted results are closed on the evaluation metrics MAP but different on the other evaluation metrics. We surmise that it is mainly because of the sequence of submitted catchphrases. The two results of the classification methods are sorted by alphabetical order, while the results of ranking are submitted in descending order of sorted scores.

In addition, we only choose the words not the phrases as the catchphrases. We tried to use some rule-based methods to construct phrases according to the word we extracted from the legal documents, but we have not achieved the improvement on performance. The low scores on MRP, MP@10, MR@100 and Overall Recall maybe caused by this reason.

4.2 Results of Precedence Retrieval 4.2.1 Dataset

Task 2 provides two data sets, the 200 current cases (Query_docs) which formed by removing the links to the 2000 prior cases and the prior cases which have been cited by the cases in Query_docs along with some random cases (not among Query_docs).

4.2.2 Experimental Settings

We do the same pre-processing for query extraction and index building. In order to discard some of the interference information in document, we filter the document through the lexical information only the nouns, verbs, adjectives, and Poter stemming, lower case and removing stop words are also implemented. For Language Model, we set the parameter mu=10000 and lambda=0.5, and for BM25, we set k1=1.8 and b=0.7.

4.2.3 Results

In Task 2, We have submitted three group of results Language Model(HLJIT2017_IRLeD_Task2_1), Vector Space Model(HLJIT2017_IRLeD_Task2_2) and BM25(HLJIT2017_IRLeD_Task2_3). The results are shown in Table 3. MAP 0.1784 Mean reciprocal 0.4074 Rank Precision@10 0.2180 0.1290 0.1665 Recall@100 0.6810 0.5950 0.6710

From the experimental results, the language model is much better than the other two models. The MAP of the language model is 0.3291, the MAP of the vector space model is 0.2479, and the lowest of the probability model is 0.1784.

Some methods which can improve the performance of language model, such query extension and document extension, have not yet been applied in this evaluation. It may be our further work. In addition, to do the smoothing of the document D, we only apply the given set of legal documents; Too small collection of documents, resulting in the sparse words, And documents and query models also failed to adjust to the optimal. We believe that these methods will improve the performance of the search model and will be tried in future research.

We described the approach to resolve the problems of Catchphrase Extraction and Precedence Retrieval in Fire2017 IRLeD task.

For the task of Catchphrase Extraction, we tried to the classification-based and ranking-based methods. Various type of features is integrated into our models. The experiments show that the ranking-based model achieved better performance.

For the task of Precedence Retrieval, we have only tried several basic language models, such as language model, probability model and vector space model. Experiments show that the language model is more excellent than vector space model and BM25.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No.61772177) and the Special subject of State Key Laboratory of Digital Publishing Technology (The research on Plagiarism Detection-From Heuristic to Machine Learning).

[1]

Mandal ,

Ghosh ,

Bhattacharya ,

Pal and

Ghosh . Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD) . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, December 8- 10 , 2017 ,

CEUR

Workshop Proceedings . CEUR-WS.org, 2017 .

[2] Du

, Xia

, Zhang

, et al. Multiple classifier system for remote sensing image classification: A review[J] . Sensors , 2012 , 12 ( 4 ): 4764 - 4792 .

[3] Liaw

, Wiener

M. Classification

and regression by randomForest [J]. R news , 2002 , 2 ( 3 ): 18 - 22 .

[4] Joachims

Learning to classify text using support vector machines: Methods, theory and algorithms[M]. Kluwer Academic Publishers, 2002 .

[5] Hulth

Improved automatic keyword extraction given more linguistic knowledge[C]. Proceedings of the 2003 conference on Empirical methods in natural language processing . Association for Computational Linguistics , 2003 : 216 - 223 .

[6] Toutanova

, Klein

, Manning

C D

, Singer

, 2003 . Fea-ture-rich part-of-speech tagging with a cyclic depend-ency network . In Proc. the 2003 Conference of the North American Chapter of the Association for Com-putational Linguistics on Human Language Technology , p. 173 - 180 .

[7] Turney P D. Coherent keyphrase extraction via web mining [J]. arXiv preprint cs/0308033 , 2003 .

[8] Song

, Croft W B. A general language model for information retrieval[C]. Proceedings of the eighth international conference on Information and knowledge management . ACM , 1999 : 316 - 321 .

[9] Robertson

, Zaragoza

The probabilistic relevance framework: BM25 and beyond [J]. Foundations and Trends in Information Retrieval , 2009 , 3 ( 4 ): 333 - 389 .

[10] McCandless

, Hatcher

, Gospodnetic

. Lucene in Action: Covers Apache Lucene 3 .0[

]. Manning Publications Co., 2010 .