Machine Learning Approaches for Catchphrase Extraction in Legal Documents Tshepho Koboyatshwene Moemedi Lefoane Lakshmi Narasimhan University of Botswana University of Botswana University of Botswana Gaborone, Botswana Gaborone, Botswana Gaborone, Botswana tshepho.koboyatshwene@mopipi.ub. moemedi.lefoane@mopipi.ub.bw lakshmi.narasimhan@mopipi.ub.bw bw ABSTRACT As the name suggests, Simple statistical approaches are very The purpose of this research was to automatically extract catch- simple, they do not need any training and are language and do- phrases given a set of Legal documents. For this task, our focus main independent. Keywords can be identified by using statistics was mainly on the Machine learning approaches: a comparative of the word such as word frequency, word co-occurrences, term approach was used between the unsupervised and supervised ap- frequency-inverse document frequency (TF-IDF), N-gram statistics. proaches. The idea was to compare the different approaches to The disadvantage with using this approach is that in some do- see which one of the two was comparatively better for automatic mains such as health and medical, the most important keyword catchphrase extraction given a dataset of Legal documents. To per- may appear only once [1, 6]. Linguistic approaches looks at lin- form this, two open source text mining software were used; one for guistic features of words, sentence and document such as lexical, the unsupervised approach while another one was used for the su- syntactic structure and semantic analysis [1, 6]. Machine Learning pervised approach. We then fine tuned some parameters for each approaches consists of both unsupervised and supervised. See Sec- tool before extracting catchphrases. The training dataset was used tion 2. Other approaches consist of a combination of the methods when fine tuning parameters in order to find optimal parameters described above and could also incorporate heuristic knowledge that were then used for generating the final catchphrases. Differ- such as the position, the length, the layout feature of terms [1, 6]. ent metrics were used to evaluate the results. We used the most common measures in Information Extraction which include Pre- This paper is organized as follows: We first presented related cision and Recall and the results from the two Machine learning work for the supervised and unsupervised Machine learning ap- approaches were compared. In general our results showed that the proaches mainly focusing on Rapid Automatic Keyword Extrac- supervised approach performed far much better than the unsuper- tion, RAKE [5] and Multi-purpose automatic topic indexing, MAUI [4], vised approach. followed by the approach we suggested which included all the ex- perimental setups performed. Thirdly we outlined a brief overview KEYWORDS of measures used for evaluating the results. We then presented and discussed the results. Lastly we concluded and briefly talked about Catchphrase extraction, Legal domain, IRLeD possible future work. 1 INTRODUCTION 2 RELATED WORK Automatic keyword or catch phrase extraction is an area of re- According to Lima et al and Rose et al [5, 6], RAKE is an unsu- search that seems like it has not been exploited much. Determining pervised Machine Learning approach which does not require any catchphrases manually can be time consuming, expensive and usu- training and works by first selecting candidates keywords. Lima ally require expertise to perform the work [1], this therefore has et al and Rose et al [5, 6] outlined RAKE’s input parameters con- motivated research towards automatic keyword extraction. There sisting of a stop list, a set of phrase delimiters, and a set of word are different terminologies used to define terms that represent the delimiters. Firstly, the document is partitioned into candidate key- most relevant or useful information contained in a document such words using the phrase and word delimiters. After the selection of as: key phrases, key segments, key terms and keywords [1]. In candidate keywords a graph of word co-occurrences is then cre- the FIRE2017 Information Retrieval from Legal Documents (IRLeD) ated. Each candidate keywords is then assigned a score. Several task [2], the word "catchphrase" is used instead of a keyword or key metrics were used to calculate the score namely: word frequency, phrase in the Legal domain. f req(w ), word degree, deд(w ) and the ratio of word degree to word deд(w ) frequency defined as [ratio = f r eq (w ) ]. Candidate keywords are Keyword Extraction involves automatically searching and iden- then ranked starting with the highest. tifying keywords within a document that best describes the sub- ject of the document [1, 6]. Methods used for automatic keyword According to Medelyan [4] MAUI was build based on four open- extraction can be classified into different approaches. According source software components: the Keyphrase extraction algorithm to Beliga et al and Lima et al [1, 6], the methods can use Simple (Kea) used for phrase filtering and computing n-gram extractions, statistical approaches, Linguistics approaches, Machine Learning Weka used for creating topic indexing models and applying them approaches among others. to new documents, Jena used for incorporating controlled vocab- ularies coming from external sources and Wikipedia Miner used for accessing Wikipedia data. The four open-source software are (2) The number of phrases for each keyword can be tuned to used together with other classes to form a single topic indexing varies words represented as No of word/phrase in Table 1 algorithm used to generate candidate topics, to compute their fea- (3) The number of times a keyword appears in a given text can tures, to build the topic indexing model and to apply the model be limited to a certain number represented as keyword fre- to new document [4]. To create a model, a training dataset with quency in Table 1. known keyphrases is required. The only keyphrases that would then be classified will be the ones that have already been incorpo- 3.4 Experiment 2 - MAUI parameter tuning on rated in the training data. Candidate phrases are selected in three training dataset steps namely: cleaning of input, phrase identification and lastly As it was done in Section 3.3, parameter turning experiments were case-folding and stemming [7]. MAUI has a parameter that can be performed in order to find the optimal parameters for MAUI. The varied in order to control the size of the training set. Some candi- only parameter tuned for MAUI was to vary the frequency of occur- dates catchphrases are discarded based on their frequency of oc- rence of each keyword and discard some keywords based on that. currence before creating a model. This will therefore reduce the The default MAUI parameter discards any candidate phrase(s) that size of the model [4]. appeared less than two times. See Table 2. 3 PROPOSED APPROACH 3.5 Final Run 1: Using RAKE A keyword extraction library called RAKE [5] was used for the RAKE was used to generate catchphrases for the Test documents unsupervised approach while MAUI [4] was used for the super- provided with parameters tuned to 3 3 1: meaning each word had vised approach. RAKE [5] and MAUI [4] consisted of parameters atleast 3 characters, each phrase had at most 3 words and each key- that were fine tuned before generating catchphrases. The approach word appeared in the text at least once. used in this research was to set RAKE and MAUI parameters to dif- UBIRLeD_1 - Catchphrases were generated for each document to- ferent values. Then use part of the training dataset with known gether with the corresponding scores for each catchphrase. catchphrases for evaluation. The results of each approach were evaluated individually in order to determine optimal parameters 3.6 Final Run 2: Using MAUI that would be used for extracting catchphrases on the testing data. The supervised machine learning approach (MAUI) was used where We then generated the final catchphrases using the testing data a classifier was trained by using all the training documents pro- provided and the optimal parameters that yielded better results on vided with known training catchphrases in the training set. No each approach. candidates were discarded prior to training the model. We then used the trained model to generate catchphrases for the test docu- 3.1 Experimental Setup ments. 3.2 Dataset UBIRLeD_2: 150 catchphrases were generated for each test docu- For the IRLeD task, the dataset provided contained the following: ment. The highest ranked catchphrases appeared first for each test document. (1) Train docs - consisted of 100 case statements. (2) Train catches - contained the gold standard catchwords for 4 EVALUATION each of the 100 case statements provided in the Train docs. Several measures were used to evaluate the results of the two ap- (3) Test docs - contained 300 test case statements. For each of proaches. In this experiments we looked at Recall, Precision and these 300 statements, a set of catchphrases was generated. Mean Average Precision among others. The training dataset was randomly divided into two groups con- sisting of 90 documents and 10 documents from the dataset. The 90 4.1 Recall Measure documents dataset was only used for training the supervised ma- According to Manning et al [3] Recall is defined as the fraction chine learning approach while the remaining 10 documents dataset of relevant documents that are retrieved. In this task, we were in- were used for testing both the unsupervised and supervised method- terested in the fraction of relevant catchphrases retrieved in each ologies. document. The formula for Recall is given in Figure 1, where tp rep- resents true positive; these are relevant retrieved catchphrases and 3.3 Experiment 1 - RAKE parameter tuning on f n represents false negative; these are relevant but not retrieved training dataset catchphrases. RAKE consisted of the following parameters which were fine tuned for different experiments in order to find the optimal parameter val- tp Recall = ues that yielded the best performance on the training set provided. tp + f n Table 1 provides more details on parameters experimented with as well as performance results. Figure 1: Recall equation as described by Manning et al [3] (1) The number of character can be varied in order to select keywords with a certain number of characters represented Recall@K would be the proportion of relevant catchphrases that as No of Char/word in Table 1. have been retrieved in the top-K. 2 Table 1: Results for RAKE parameter tuning RAKE Experiments For Parameter Tuning Test Number No of Char/word No of words/phrase keyword frequency Recall 1 5 3 4 5.65 2 3 3 1 25.78 3 3 3 2 19.64 4 3 3 3 13.04 5 3 3 4 8.62 Table 2: Results for MAUI parameter tuning r RPrecision = R MAUI Experiments For Parameter Tuning Test Number Frequency of phrases to keep Recall Figure 3: R Precision equation as described by Manning et 1 1 68.27 al [3] 2 2 48.62 3 3 31.24 is given in Figure 4, where MAP (Q ) is mean of average precision 4 10 6.03 across the whole collection list of queries being the test documents in this task. Precision(R jk ) is the precision score of ranked retrieved catchphrases from the top results until position k for test docu- 4.2 Precision Measure ment j. For each of the test documents, a set of ranked catchphrases Precision is described as the fraction retrieved documents which was produced, which was then used to compute precision and aver- are relevant according to Manning et al [3]. In this time precision age precision (AP). Average precision is the mean of the precision will be the fraction of retrieved catchphrases that are relevant. The scores after each relevant catchphrase is retrieved. formula for Precision is given in Figure 2, where tp represents true positive and f p represents false positives; a situation in which non- |Q | 1 ∑ 1 ∑ mj relevant catchphrases have been retrieved as relevant. MAP (Q ) = Precision(R jk ) |Q | j=i m j k=1 tp Precision = tp + f p Figure 4: MAP equation as given by Manning et al [3] Figure 2: Precision equation as described by Manning et al [3] 5 RESULTS Consider the results displayed in Table 3 UBIRLed_1 and UBIRLed_2 Precision@K, would be the proportion of top-K catchphrases rows contain performance measures obtained after using the gen- that are relevant. Mean Precision@K, will then cover the Mean of erated catchphrases from RAKE and MAUI respectively as men- the Precision@K of each test document in the whole collection. tioned in . Using the performance measures stated in Section 4, we observed that MAUI; the supervised approach, performed far We used Manning et al [3]’s ideas when finding the Mean R much better than RAKE; the unsupervised approach. Comparing precision. Computing Mean R precision required knowledge of all the results based on Mean Precision@10, we discovered that the catchphrases that were relevant on each test document where R proportion of top 10 catchphrases which were relevant was more represented the total number of expected relevant catchphrases effective using MAUI, MAUI result was 0.254 while RAKE result for a particular test document. R was then used as the cutoff for was 0.013. We also looked at the Mean Recall@100, MAUI still out- calculating precision. Precision would be equal to recall at the R- performed RAKE by retrieving more relevant catchphrases in the th position. Suppose that R relevant catchphrases were expected top 100. When finding MAP, the assumption was that we were in- for test document Td1, and only r relevant catchphrases were re- terested in finding more relevant catchphrases for each test docu- trieved at position R. We would only calculate precision of the top ments and hence we computed the Mean of average precision val- R catchphrases retrieved using the formula given in Figure 3. The ues of each test documents. The value of MAP obtained for MAUI Mean R precision would be the mean of R precision of all the test was higher than the value computed using RAKE results. The Mean documents (queries) R precision value for MAUI had far much better proportion of re- trieved catchphrases which were relevant considering the cutoff 4.3 Mean Average Precision Measure point which was equals the number of relevant catchphrases ex- Mean Average Precision (MAP) value is defined as "the arithmetic pected for each and every document provided in the testing dataset. mean of average precision values for individual information needs" Overall recall, RAKE was better although that was the only mea- Manning et al [3]. The formula for Mean Average Precision (MAP) sure good compared to MAUI’s performance. 3 Table 3: Final Results for RAKE and MAUI using Test documents RAKE and MAUI Results Evaluation Metrics Mean R precision Mean Precision@10 Mean Recall@100 MAP Overall Recall UBIRLed_1 0.02316392684 0.01366666667 0.1723154757 0.04634794783 0.4992190452 UBIRLed_2 0.1901020309 0.2543333333 0.3050612978 0.3703664676 0.3259790763 6 CONCLUSION AND FUTURE WORK In this paper we had proposed and compared two Machine Learn- ing approaches namely: RAKE and MAUI for the unsupervised and supervised approaches respectively. In the proposed approach, fine tuning parameters before generating candidate catchphrases resulted in obtaining the optimal parameters for each method used. Based on the optimal parameters used for generating the final catch- phrases, overall MAUI had high performance compared to RAKE. The differences in the performance was observed in most areas. RAKE achieved the highest recall but the precision was very low compared to MAUI. We strongly believe that Legal domain is an area which still requires a lot of work on Information Extraction. For the future work, we plan to experiment with different tech- niques used on the supervised approach in Machine learning and evaluate the performance after applying the different techniques. REFERENCES [1] Slobodan Beliga, Ana Metrovi, and Sanda Martini-Ipi. 2015. An Overview of Graph-Based Keyword Extraction Methods and Approaches. Journal of infor- mation and organizational sciences 39, 1 (June 2015), 1–20. http://hrcak.srce.hr/ 140857 [2] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap- tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org. [3] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. In- troduction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html [4] Olena Medelyan. 2009. Human-competitive automatic topic indexing. (2009). http://cds.cern.ch/record/1198029 Presented on July 2009. [5] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic Key- word Extraction from Individual Documents. (03 2010), 1 – 20. [6] Lima Subramanian and R.S Karthik. 2017. KEYWORD EXTRACTION: A COM- PARATIVE STUDY USING GRAPH BASED MODEL AND RAKE. Int. J. of Adv. Res. 5 (3) (2017), 1133–1137. https://doi.org/10.21474/IJAR01/3616 [7] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill- Manning. 1999. KEA: Practical Automatic Keyphrase Extraction. (5 Feb. 1999). arXiv:cs/9902007 http://arxiv.org/abs/cs/9902007 4