Machine Learning Approaches for Catchphrase Extraction in
                       Legal Documents
       Tshepho Koboyatshwene                              Moemedi Lefoane                          Lakshmi Narasimhan
          University of Botswana                       University of Botswana                     University of Botswana
           Gaborone, Botswana                           Gaborone, Botswana                          Gaborone, Botswana
    tshepho.koboyatshwene@mopipi.ub.                moemedi.lefoane@mopipi.ub.bw             lakshmi.narasimhan@mopipi.ub.bw
                    bw
ABSTRACT                                                                  As the name suggests, Simple statistical approaches are very
The purpose of this research was to automatically extract catch-       simple, they do not need any training and are language and do-
phrases given a set of Legal documents. For this task, our focus       main independent. Keywords can be identified by using statistics
was mainly on the Machine learning approaches: a comparative           of the word such as word frequency, word co-occurrences, term
approach was used between the unsupervised and supervised ap-          frequency-inverse document frequency (TF-IDF), N-gram statistics.
proaches. The idea was to compare the different approaches to          The disadvantage with using this approach is that in some do-
see which one of the two was comparatively better for automatic        mains such as health and medical, the most important keyword
catchphrase extraction given a dataset of Legal documents. To per-     may appear only once [1, 6]. Linguistic approaches looks at lin-
form this, two open source text mining software were used; one for     guistic features of words, sentence and document such as lexical,
the unsupervised approach while another one was used for the su-       syntactic structure and semantic analysis [1, 6]. Machine Learning
pervised approach. We then fine tuned some parameters for each         approaches consists of both unsupervised and supervised. See Sec-
tool before extracting catchphrases. The training dataset was used     tion 2. Other approaches consist of a combination of the methods
when fine tuning parameters in order to find optimal parameters        described above and could also incorporate heuristic knowledge
that were then used for generating the final catchphrases. Differ-     such as the position, the length, the layout feature of terms [1, 6].
ent metrics were used to evaluate the results. We used the most
common measures in Information Extraction which include Pre-              This paper is organized as follows: We first presented related
cision and Recall and the results from the two Machine learning        work for the supervised and unsupervised Machine learning ap-
approaches were compared. In general our results showed that the       proaches mainly focusing on Rapid Automatic Keyword Extrac-
supervised approach performed far much better than the unsuper-        tion, RAKE [5] and Multi-purpose automatic topic indexing, MAUI [4],
vised approach.                                                        followed by the approach we suggested which included all the ex-
                                                                       perimental setups performed. Thirdly we outlined a brief overview
KEYWORDS                                                               of measures used for evaluating the results. We then presented and
                                                                       discussed the results. Lastly we concluded and briefly talked about
Catchphrase extraction, Legal domain, IRLeD                            possible future work.

1 INTRODUCTION                                                         2 RELATED WORK
Automatic keyword or catch phrase extraction is an area of re-         According to Lima et al and Rose et al [5, 6], RAKE is an unsu-
search that seems like it has not been exploited much. Determining     pervised Machine Learning approach which does not require any
catchphrases manually can be time consuming, expensive and usu-        training and works by first selecting candidates keywords. Lima
ally require expertise to perform the work [1], this therefore has     et al and Rose et al [5, 6] outlined RAKE’s input parameters con-
motivated research towards automatic keyword extraction. There         sisting of a stop list, a set of phrase delimiters, and a set of word
are different terminologies used to define terms that represent the    delimiters. Firstly, the document is partitioned into candidate key-
most relevant or useful information contained in a document such       words using the phrase and word delimiters. After the selection of
as: key phrases, key segments, key terms and keywords [1]. In          candidate keywords a graph of word co-occurrences is then cre-
the FIRE2017 Information Retrieval from Legal Documents (IRLeD)        ated. Each candidate keywords is then assigned a score. Several
task [2], the word "catchphrase" is used instead of a keyword or key   metrics were used to calculate the score namely: word frequency,
phrase in the Legal domain.                                            f req(w ), word degree, deд(w ) and the ratio of word degree to word
                                                                                                          deд(w )
                                                                       frequency defined as [ratio = f r eq (w ) ]. Candidate keywords are
   Keyword Extraction involves automatically searching and iden-       then ranked starting with the highest.
tifying keywords within a document that best describes the sub-
ject of the document [1, 6]. Methods used for automatic keyword           According to Medelyan [4] MAUI was build based on four open-
extraction can be classified into different approaches. According      source software components: the Keyphrase extraction algorithm
to Beliga et al and Lima et al [1, 6], the methods can use Simple      (Kea) used for phrase filtering and computing n-gram extractions,
statistical approaches, Linguistics approaches, Machine Learning       Weka used for creating topic indexing models and applying them
approaches among others.                                               to new documents, Jena used for incorporating controlled vocab-
                                                                       ularies coming from external sources and Wikipedia Miner used
for accessing Wikipedia data. The four open-source software are                (2) The number of phrases for each keyword can be tuned to
used together with other classes to form a single topic indexing                   varies words represented as No of word/phrase in Table 1
algorithm used to generate candidate topics, to compute their fea-             (3) The number of times a keyword appears in a given text can
tures, to build the topic indexing model and to apply the model                    be limited to a certain number represented as keyword fre-
to new document [4]. To create a model, a training dataset with                    quency in Table 1.
known keyphrases is required. The only keyphrases that would
then be classified will be the ones that have already been incorpo-         3.4 Experiment 2 - MAUI parameter tuning on
rated in the training data. Candidate phrases are selected in three             training dataset
steps namely: cleaning of input, phrase identification and lastly
                                                                            As it was done in Section 3.3, parameter turning experiments were
case-folding and stemming [7]. MAUI has a parameter that can be
                                                                            performed in order to find the optimal parameters for MAUI. The
varied in order to control the size of the training set. Some candi-
                                                                            only parameter tuned for MAUI was to vary the frequency of occur-
dates catchphrases are discarded based on their frequency of oc-
                                                                            rence of each keyword and discard some keywords based on that.
currence before creating a model. This will therefore reduce the
                                                                            The default MAUI parameter discards any candidate phrase(s) that
size of the model [4].
                                                                            appeared less than two times. See Table 2.
3 PROPOSED APPROACH                                                         3.5 Final Run 1: Using RAKE
A keyword extraction library called RAKE [5] was used for the               RAKE was used to generate catchphrases for the Test documents
unsupervised approach while MAUI [4] was used for the super-                provided with parameters tuned to 3 3 1: meaning each word had
vised approach. RAKE [5] and MAUI [4] consisted of parameters               atleast 3 characters, each phrase had at most 3 words and each key-
that were fine tuned before generating catchphrases. The approach           word appeared in the text at least once.
used in this research was to set RAKE and MAUI parameters to dif-           UBIRLeD_1 - Catchphrases were generated for each document to-
ferent values. Then use part of the training dataset with known             gether with the corresponding scores for each catchphrase.
catchphrases for evaluation. The results of each approach were
evaluated individually in order to determine optimal parameters             3.6 Final Run 2: Using MAUI
that would be used for extracting catchphrases on the testing data.
                                                                            The supervised machine learning approach (MAUI) was used where
We then generated the final catchphrases using the testing data
                                                                            a classifier was trained by using all the training documents pro-
provided and the optimal parameters that yielded better results on
                                                                            vided with known training catchphrases in the training set. No
each approach.
                                                                            candidates were discarded prior to training the model. We then
                                                                            used the trained model to generate catchphrases for the test docu-
3.1 Experimental Setup                                                      ments.
3.2 Dataset                                                                 UBIRLeD_2: 150 catchphrases were generated for each test docu-
For the IRLeD task, the dataset provided contained the following:           ment. The highest ranked catchphrases appeared first for each test
                                                                            document.
   (1) Train docs - consisted of 100 case statements.
   (2) Train catches - contained the gold standard catchwords for           4 EVALUATION
       each of the 100 case statements provided in the Train docs.
                                                                            Several measures were used to evaluate the results of the two ap-
   (3) Test docs - contained 300 test case statements. For each of
                                                                            proaches. In this experiments we looked at Recall, Precision and
       these 300 statements, a set of catchphrases was generated.
                                                                            Mean Average Precision among others.
   The training dataset was randomly divided into two groups con-
sisting of 90 documents and 10 documents from the dataset. The 90           4.1 Recall Measure
documents dataset was only used for training the supervised ma-             According to Manning et al [3] Recall is defined as the fraction
chine learning approach while the remaining 10 documents dataset            of relevant documents that are retrieved. In this task, we were in-
were used for testing both the unsupervised and supervised method-          terested in the fraction of relevant catchphrases retrieved in each
ologies.                                                                    document. The formula for Recall is given in Figure 1, where tp rep-
                                                                            resents true positive; these are relevant retrieved catchphrases and
3.3 Experiment 1 - RAKE parameter tuning on                                 f n represents false negative; these are relevant but not retrieved
    training dataset                                                        catchphrases.
RAKE consisted of the following parameters which were fine tuned
for different experiments in order to find the optimal parameter val-                                             tp
                                                                                                    Recall =
ues that yielded the best performance on the training set provided.                                            tp + f n
Table 1 provides more details on parameters experimented with as
well as performance results.                                                Figure 1: Recall equation as described by Manning et al [3]
   (1) The number of character can be varied in order to select
       keywords with a certain number of characters represented               Recall@K would be the proportion of relevant catchphrases that
       as No of Char/word in Table 1.                                       have been retrieved in the top-K.
                                                                        2
                                                 Table 1: Results for RAKE parameter tuning

                                                   RAKE Experiments For Parameter Tuning
                           Test Number       No of Char/word No of words/phrase keyword frequency                 Recall
                                 1                  5                  3                 4                         5.65
                                 2                  3                  3                 1                        25.78
                                 3                  3                  3                 2                        19.64
                                 4                  3                  3                 3                        13.04
                                 5                  3                  3                 4                         8.62


        Table 2: Results for MAUI parameter tuning                                                                       r
                                                                                                        RPrecision =
                                                                                                                         R
            MAUI Experiments For Parameter Tuning
      Test Number Frequency of phrases to keep Recall                        Figure 3: R Precision equation as described by Manning et
            1                    1                68.27                      al [3]
            2                     2               48.62
            3                     3               31.24                      is given in Figure 4, where MAP (Q ) is mean of average precision
            4                    10                6.03                      across the whole collection list of queries being the test documents
                                                                             in this task. Precision(R jk ) is the precision score of ranked retrieved
                                                                             catchphrases from the top results until position k for test docu-
4.2 Precision Measure
                                                                             ment j. For each of the test documents, a set of ranked catchphrases
Precision is described as the fraction retrieved documents which             was produced, which was then used to compute precision and aver-
are relevant according to Manning et al [3]. In this time precision          age precision (AP). Average precision is the mean of the precision
will be the fraction of retrieved catchphrases that are relevant. The        scores after each relevant catchphrase is retrieved.
formula for Precision is given in Figure 2, where tp represents true
positive and f p represents false positives; a situation in which non-                                     |Q |
                                                                                                        1 ∑ 1 ∑
                                                                                                                    mj
relevant catchphrases have been retrieved as relevant.
                                                                                          MAP (Q ) =                Precision(R jk )
                                                                                                       |Q | j=i m j
                                                                                                                   k=1
                                         tp
                        Precision =
                                      tp + f p
                                                                                Figure 4: MAP equation as given by Manning et al [3]

Figure 2: Precision equation as described by Manning et
al [3]                                                                       5 RESULTS
                                                                             Consider the results displayed in Table 3 UBIRLed_1 and UBIRLed_2
   Precision@K, would be the proportion of top-K catchphrases                rows contain performance measures obtained after using the gen-
that are relevant. Mean Precision@K, will then cover the Mean of             erated catchphrases from RAKE and MAUI respectively as men-
the Precision@K of each test document in the whole collection.               tioned in . Using the performance measures stated in Section 4,
                                                                             we observed that MAUI; the supervised approach, performed far
   We used Manning et al [3]’s ideas when finding the Mean R                 much better than RAKE; the unsupervised approach. Comparing
precision. Computing Mean R precision required knowledge of all              the results based on Mean Precision@10, we discovered that the
catchphrases that were relevant on each test document where R                proportion of top 10 catchphrases which were relevant was more
represented the total number of expected relevant catchphrases               effective using MAUI, MAUI result was 0.254 while RAKE result
for a particular test document. R was then used as the cutoff for            was 0.013. We also looked at the Mean Recall@100, MAUI still out-
calculating precision. Precision would be equal to recall at the R-          performed RAKE by retrieving more relevant catchphrases in the
th position. Suppose that R relevant catchphrases were expected              top 100. When finding MAP, the assumption was that we were in-
for test document Td1, and only r relevant catchphrases were re-             terested in finding more relevant catchphrases for each test docu-
trieved at position R. We would only calculate precision of the top          ments and hence we computed the Mean of average precision val-
R catchphrases retrieved using the formula given in Figure 3. The            ues of each test documents. The value of MAP obtained for MAUI
Mean R precision would be the mean of R precision of all the test            was higher than the value computed using RAKE results. The Mean
documents (queries)                                                          R precision value for MAUI had far much better proportion of re-
                                                                             trieved catchphrases which were relevant considering the cutoff
4.3 Mean Average Precision Measure                                           point which was equals the number of relevant catchphrases ex-
Mean Average Precision (MAP) value is defined as "the arithmetic             pected for each and every document provided in the testing dataset.
mean of average precision values for individual information needs"           Overall recall, RAKE was better although that was the only mea-
Manning et al [3]. The formula for Mean Average Precision (MAP)              sure good compared to MAUI’s performance.
                                                                         3
                                          Table 3: Final Results for RAKE and MAUI using Test documents

                                                                       RAKE and MAUI Results
             Evaluation Metrics           Mean R precision          Mean Precision@10 Mean Recall@100        MAP        Overall Recall
                UBIRLed_1                  0.02316392684              0.01366666667     0.1723154757    0.04634794783   0.4992190452
                UBIRLed_2                   0.1901020309               0.2543333333     0.3050612978     0.3703664676   0.3259790763


6 CONCLUSION AND FUTURE WORK
In this paper we had proposed and compared two Machine Learn-
ing approaches namely: RAKE and MAUI for the unsupervised
and supervised approaches respectively. In the proposed approach,
fine tuning parameters before generating candidate catchphrases
resulted in obtaining the optimal parameters for each method used.
Based on the optimal parameters used for generating the final catch-
phrases, overall MAUI had high performance compared to RAKE.
The differences in the performance was observed in most areas.
RAKE achieved the highest recall but the precision was very low
compared to MAUI. We strongly believe that Legal domain is an
area which still requires a lot of work on Information Extraction.
For the future work, we plan to experiment with different tech-
niques used on the supervised approach in Machine learning and
evaluate the performance after applying the different techniques.

REFERENCES
[1] Slobodan Beliga, Ana Metrovi, and Sanda Martini-Ipi. 2015. An Overview of
    Graph-Based Keyword Extraction Methods and Approaches. Journal of infor-
    mation and organizational sciences 39, 1 (June 2015), 1–20. http://hrcak.srce.hr/
    140857
[2] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap-
    tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from
    Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information
    Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org.
[3] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. In-
    troduction to Information Retrieval. Cambridge University Press, Cambridge, UK.
    http://nlp.stanford.edu/IR-book/information-retrieval-book.html
[4] Olena Medelyan. 2009. Human-competitive automatic topic indexing. (2009).
    http://cds.cern.ch/record/1198029 Presented on July 2009.
[5] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic Key-
    word Extraction from Individual Documents. (03 2010), 1 – 20.
[6] Lima Subramanian and R.S Karthik. 2017. KEYWORD EXTRACTION: A COM-
    PARATIVE STUDY USING GRAPH BASED MODEL AND RAKE. Int. J. of Adv.
    Res. 5 (3) (2017), 1133–1137. https://doi.org/10.21474/IJAR01/3616
[7] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-
    Manning. 1999. KEA: Practical Automatic Keyphrase Extraction. (5 Feb. 1999).
    arXiv:cs/9902007 http://arxiv.org/abs/cs/9902007


                                                                                        4