=Paper=
{{Paper
|id=Vol-2036/T3-7
|storemode=property
|title=HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents
|pdfUrl=https://ceur-ws.org/Vol-2036/T3-7.pdf
|volume=Vol-2036
|authors=Liuyang Tian,Hui Ning,Leilei Kong,Zhongyuan Han,Ruiming Xiao,Haoliang Qi
|dblpUrl=https://dblp.org/rec/conf/fire/TianNKHXQ17
}}
==HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents==
HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents Liuyang Tian Hui Ning Leilei Kong* School of Computer Science and School of Computer Science and School of Computer Science and Technology, Harbin Engineering Technology, Harbin Engineering Technology, Heilongjiang Institute of University, Harbin, China University, Harbin, China Technology, Harbin, China tianliuyang2016@outlook.com ninghui@hrbeu.edu.cn kongleilei1979@gmail.com Zhongyuan Han Ruiming Xiao Haoliang Qi School of Computer Science and School of Computer Science and School of Computer Science and 1 Technology, Heilongjiang Institute of Technology, Harbin University of Technology, Heilongjiang Institute of Technology, Harbin, China Science and Technology, Harbin, China Technology, Harbin, China Hanzhongyuan@gmail.com xiaoruiming11@outlook.com haoliang.qi@gmail.com 2 State Key Laboratory of Digital Publishing Technology ABSTRACT bagging classification methods to identify the catchphrases. Then, we tried a learning to rank method to rank the words This paper details the approach of implementing the in document to select the catchphrases. Catchphrase Extraction and Precedence Retrieval tasks to The task of Precedence Retrieval can be viewed as an be presented at Information Retrieval from Legal information retrieval problem. Its purpose is to retrieve the Documents by Forum of Information Retrieval Evaluation relevant prior cases for a given current case. We used three in 2017(Fire2017 IRLeD). For the task of Catchphrase classical models of information retrieval, the language Extraction, the classification-based and Rank-based model, the BM25, and the vector space model, to retrieve methods were exploited, and various types of features were the relevant documents for a given current case document. attempted. With respect to the task of Precedence Retrieval, The rest of this paper is organized as follows. In Section the language model, the BM25 and the vector space model 2, we introduced the methods and related features used in were employed. Comparisons to other submissions for the Catchphrase Extraction. In Section 3, we described the same tasks, show the presented methods to be one of the top various search model methods used in Precedence Retrieval. performers. In Section 4, we reported the experimental setting and KEYWORDS results. In the last section, we concluded our study. Information Retrieval from Legal Documents, Catchphrase 2 Method of Catchphrase Extraction Extraction, Precedence Retrieval Let di be a legal document, and a training corpus can be 1 Introduction defined as: D = {(x1 , y1 ), (x 2 , y 2 ),..., (x i , y i ),..., (x n , y n )} (1) With the recent developments in information technology, (1) ( 2) (n) T the number of digitally available legal documents has where xi ( xi , xi ..., xi ) , i 1,2,..., n . And yi is the label rapidly increased. In general, the legal text is long and to denote whether the word wi is the catchphrase of di or not. complex in structure, which makes their thorough reading Then, our goal is to learn a model to decide whether a word time-consuming and strenuous[1]. The task of Information is the catchphrase when given a new di. The classification- Retrieval from Legal Documents1 is devoted to this problem. The task is divided into two parts by Forum of Information based methods and Ranking-based methods are exploited to Retrieval Evaluation (FIRE): Catchphrase Extraction and learn the model respectively. Precedence Retrieval. 2.1 Classification Method: Bagging The task of Catchphrase Extraction focuses on extracting Bagging is based on bootstrap sampling. First, we use the catchphrases (short relevant phrases) from legal the M-round bootstrap sampling method to obtain M documents. We formalized the task of Catchphrase samples containing N training samples. Then, based on Extraction as a multi-classification firstly and used the these sampling sets, a base learner is trained. Finally, the M-based learners are combined. The problem of multi- *Corresponding Author 1 https://sites.google.com/view/fire2017irled/track-description classification is resolved by a simple voting method[2]. Decision tree and random forest[3] are adopted as our base p ( w1 , w2 ) I ( w1 , w2 ) p ( w1 , w2 ) log( ) (7) classifiers. Denoted as Bagging(DTC) and Bagging(RFC) p ( w1 ) p ( w2 ) respectively. where N is the number of word pairs (wi,wj), And MI(wi, wj) represents the mutual information of wi and wj. 2.2 Ranking Method: RankSVM Prior Probabilities Using the gold standard, we build a RankSVM is a pair-wise learning to rank method that uses priori keyword set P. P contains the keywords with their idf the SVM model to solve the ranking problem on document is greater than two. pairs. To rank the candidate catchphrase, we trained a 1, w P Pr ior _ score( w, P) (8) ranking function on a training corpus using the Ranking 0, w P SVM2. 3 Methods of Precedence Retrieval 2.3 Features The task of Precedence Retrieval is to retrieve the prior We construct the features from five aspects: statistical cases for a given a current case. We view the current case as features, position features, syntactic features, mutual the query, while the prior cases as the documents, and use information and prior probabilities. three classical information retrieval models to resolve the Statistical Features We use the term frequency(TF), problem of precedence retrieval. inverse document frequency(IDF), TF*IDF, and BM25 3.1 Language Model method score of a word in document di as the statistical features. For the query model and the document model, we use the Position Features language model based on Dirichlet Prior Smoothing[8]. The 1) The first time of the word appears. relevance between query and document is computed as num follows. FirstOccur ( w, d ) (2) score(q, d ) log p(q | d ) FirstOccur _score(w) i 1 num c( wi , d ) precede ( w , d ) c( w, q) log[1 ] n log d (9) FirstOccur ( w , d ) (3) qi d p( wi | C ) len ( d ) wi q where num is the number of legal documents, precede(w,d) 3.2 Probability Model represents the number of words in front of the first The second search model we chose is BM25 model[9]. The occurrence of the word w in the legal document d, len(d) is relevance score is computed as follows. the number of words of d. n tf ( wi , d ) idf ( wi ) (k1 1) BM 25 log( ) 2) sentence-initial or sentence-end position. wi q tf ( wi , d ) k1 (1 b b len(d ) ) (10) num avdl InFirstLas t ( w, d ) (4) where q is the query set, d is the candidate document, avdl InFirstLas t_ score ( w) i 1 is the average length of the document, k1 and b are the num precede( w, d ) adjustment parameters. FirstOccur( w, d ) (5) 3.3 Vector Space Model len(d ) We also used lucene[10] which implemented the vector POS features We also choose the part-of-speech of a space model to estimate the relevance of query and word as the features. For each sentence, we get the POS of document. The Formula is shown as follows: each word using the Stanford POS Tagger[6]. We choose a score (q, d ) coord (q, d ) queryNorm(q ) subset of POS (i.e. NNS, NNPS, NNP, NN, VBZ, VBP, (tf (t in d ) idf (t ) t.getBoost () norm(t , d )) 2 (11) VBN, VBG, VBD, VB, TO, JJ, RB) as our features. t in q num is the number of legal documents, countFL(w,d) represents the number of occurrences of the word w in the Here, t represents the term containing the domain first of sentence or the end of the sentence in the legal information; coord (q, d) means that when a document document d. contains more search terms, the higher score of the Mutual Information High quality keywords should be document. tf(t in d) represents the word frequency that semantically related. If the relevance between a word w and appears in document d; idf(t) word reverse document a keywords k is low, then w may not be suitable as a frequency; norm (t, d) represents the normalization factor; keyword[7]. The degree of correlation between words is measured by the average mutual information. 4 Experimental Results 1 4.1 Results of Catchphrase Extraction MI ( w1 , w2 ,...wk ) MI (wi , w j ) n i , j 1, 2,...k ;i j (6) 4.1.1 Dataset In this task, a set of legal documents (Indian Supreme Court decisions) are provided. For a few of these documents 2 http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html 2 (training set), the catchphrases (gold standard) are provided. evaluation metrics. We surmise that it is mainly because of Catchphrases are short phrases from within the text of the the sequence of submitted catchphrases. The two results of document. These catchphrases have been obtained from a the classification methods are sorted by alphabetical order, well-known legal search system Manupatra while the results of ranking are submitted in descending (www.manupatra.co.in), which employs legal experts to order of sorted scores. annotate case documents with catchphrases. The rest of the In addition, we only choose the words not the phrases as documents will be used as the test set. The dataset of the catchphrases. We tried to use some rule-based methods Catchphrase Extraction contains 100 training examples and to construct phrases according to the word we extracted 300 testing examples. from the legal documents, but we have not achieved the 4.1.2 Experimental Settings improvement on performance. The low scores on MRP, Firstly, we get a candidate catchphrase set. The statistics MP@10, MR@100 and Overall Recall maybe caused by on the training corpus of catchphrase extraction show that, this reason. the nouns accounted for 68.85%, the verbs accounted for 4.2 Results of Precedence Retrieval 9.13%, adjectives accounted for 12.49%, prepositions 4.2.1 Dataset accounted for 5.90%, adverbs accounted for 1.43%, other Task 2 provides two data sets, the 200 current cases types accounted for 2.20%. According to the above (Query_docs) which formed by removing the links to the distribution, we choose noun, verbs, adjectives and adverbs. 2000 prior cases and the prior cases which have been cited On the candidate set, We have different combinations of by the cases in Query_docs along with some random cases features in the classification and ranking methods. The (not among Query_docs). bagging model uses TF*IDF, BM25, Position Features, 4.2.2 Experimental Settings POS, Mutual Information and Prior Probabilities as feature We do the same pre-processing for query extraction and set. The RankSVM uses TF*IDF, POS and Prior index building. In order to discard some of the interference Probabilities as feature set. information in document, we filter the document through A detailed description of the method parameter settings the lexical information only the nouns, verbs, adjectives, is shown in the following Table 1: and Poter stemming, lower case and removing stop words are also implemented. For Language Model, we set the Table 1: Parameter setting of the model in Task1 parameter mu=10000 and lambda=0.5, and for BM25, we set k1=1.8 and b=0.7. Method Parameter 4.2.3 Results Bagging(DTC) min_samples_split=2, n_estimators=66, In Task 2, We have submitted three group of results max_samples=0.5, max_features=0.5 Language Model(HLJIT2017_IRLeD_Task2_1), Vector Bagging(RFC) n_estimators=78, min_samples_split=2, Space Model(HLJIT2017_IRLeD_Task2_2) and max_samples=0.5,max_features=0.5 BM25(HLJIT2017_IRLeD_Task2_3). The results are RankSVM c=16.0 shown in Table 3. 4.1.3 Results Table 3: FIRE 2017 IRLeD Run Evaluation for We submitted the results of the three groups, bagging Precedence Retrieval (DTC)(denoted as HLJIT2017_IRLeD_Task1_1), bagging (RFC)(denoted as HLJIT2017_IRLeD_Tas-k1_2) and Method Language Vector Space RankSVM(denoted as HLJIT2017_IRLeD_Task-1_3). The BM25 Model Model experimental results are shown in Table 2. MAP 0.3291 0.1784 0.2479 Table 2: FIRE 2017 IRLeD Run Evaluation for Mean reciprocal 0.6325 0.4074 0.5246 Catchphrase Extraction Rank Precision@10 0.2180 0.1290 0.1665 Method bagging bagging Recall@100 0.6810 0.5950 0.6710 RankSVM (DTC) (RFC) From the experimental results, the language model is Mean R Precision 0.0297 0.0335 0.0864 much better than the other two models. The MAP of the Mean Precision at10 0.0576 0.06 0.1220 language model is 0.3291, the MAP of the vector space Mean Recall at100 0.0328 0.0440 0.1514 model is 0.2479, and the lowest of the probability model is 0.1784. Overall Recall 0.0328 0.0440 0.1519 Some methods which can improve the performance of MAP 0.1401 0.1241 0.1649 language model, such query extension and document Note that our three submitted results are closed on the extension, have not yet been applied in this evaluation. It evaluation metrics MAP but different on the other may be our further work. In addition, to do the smoothing 3 of the document D, we only apply the given set of legal References documents; Too small collection of documents, resulting in [1] A. Mandal, K. Ghosh, A. Bhattacharya, A. Pal and S. Ghosh. the sparse words, And documents and query models also Overview of the FIRE 2017 track: Information Retrieval from Legal failed to adjust to the optimal. We believe that these Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India, December 8-10, methods will improve the performance of the search model 2017, CEUR Workshop Proceedings. CEUR-WS.org, 2017. and will be tried in future research. [2] Du P, Xia J, Zhang W, et al. Multiple classifier system for remote sensing image classification: A review[J]. Sensors, 2012, 12(4): 4764- 5 Conclusions 4792. [3] Liaw A, Wiener M. Classification and regression by randomForest[J]. We described the approach to resolve the problems of R news, 2002, 2(3): 18-22. Catchphrase Extraction and Precedence Retrieval in [4] Joachims T. Learning to classify text using support vector machines: Fire2017 IRLeD task. Methods, theory and algorithms[M]. Kluwer Academic Publishers, 2002. For the task of Catchphrase Extraction, we tried to the [5] Hulth A. Improved automatic keyword extraction given more classification-based and ranking-based methods. Various linguistic knowledge[C]. Proceedings of the 2003 conference on type of features is integrated into our models. The Empirical methods in natural language processing. Association for experiments show that the ranking-based model achieved Computational Linguistics, 2003: 216-223. [6] Toutanova K, Klein D, Manning C D, Singer Y, 2003. Fea-ture-rich better performance. part-of-speech tagging with a cyclic depend-ency network. In Proc. For the task of Precedence Retrieval, we have only tried the 2003 Conference of the North American Chapter of the several basic language models, such as language model, Association for Com-putational Linguistics on Human Language probability model and vector space model. Experiments Technology, p. 173–180. [7] Turney P D. Coherent keyphrase extraction via web mining[J]. arXiv show that the language model is more excellent than vector preprint cs/0308033, 2003. space model and BM25. [8] Song F, Croft W B. A general language model for information retrieval[C]. Proceedings of the eighth international conference on Acknowledgments Information and knowledge management. ACM, 1999: 316-321. [9] Robertson S, Zaragoza H. The probabilistic relevance framework: This work is supported by the National Natural Science BM25 and beyond[J]. Foundations and Trends in Information Foundation of China (No.61772177) and the Special subject Retrieval, 2009, 3(4): 333-389. [10] McCandless M, Hatcher E, Gospodnetic O. Lucene in Action: Covers of State Key Laboratory of Digital Publishing Technology Apache Lucene 3.0[M]. Manning Publications Co., 2010. (The research on Plagiarism Detection-From Heuristic to Machine Learning). 4