<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine Learning Approaches for Catchphrase Extraction in Legal Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tshepho Koboyatshwene</string-name>
          <email>tshepho.koboyatshwene@mopipi.ub</email>
          <email>tshepho.koboyatshwene@mopipi.ub. bw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moemedi Lefoane</string-name>
          <email>moemedi.lefoane@mopipi.ub.bw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lakshmi Narasimhan</string-name>
          <email>lakshmi.narasimhan@mopipi.ub.bw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Botswana</institution>
          ,
          <addr-line>Gaborone</addr-line>
          ,
          <country country="BW">Botswana</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The purpose of this research was to automatically extract catchphrases given a set of Legal documents. For this task, our focus was mainly on the Machine learning approaches: a comparative approach was used between the unsupervised and supervised approaches. The idea was to compare the diferent approaches to see which one of the two was comparatively better for automatic catchphrase extraction given a dataset of Legal documents. To perform this, two open source text mining software were used; one for the unsupervised approach while another one was used for the supervised approach. We then fine tuned some parameters for each tool before extracting catchphrases. The training dataset was used when fine tuning parameters in order to find optimal parameters that were then used for generating the final catchphrases. Diferent metrics were used to evaluate the results. We used the most common measures in Information Extraction which include Precision and Recall and the results from the two Machine learning approaches were compared. In general our results showed that the supervised approach performed far much better than the unsupervised approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Automatic keyword or catch phrase extraction is an area of
research that seems like it has not been exploited much. Determining
catchphrases manually can be time consuming, expensive and
usually require expertise to perform the work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], this therefore has
motivated research towards automatic keyword extraction. There
are diferent terminologies used to define terms that represent the
most relevant or useful information contained in a document such
as: key phrases, key segments, key terms and keywords [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In
the FIRE2017 Information Retrieval from Legal Documents (IRLeD)
task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the word "catchphrase" is used instead of a keyword or key
phrase in the Legal domain.
      </p>
      <p>
        Keyword Extraction involves automatically searching and
identifying keywords within a document that best describes the
subject of the document [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ]. Methods used for automatic keyword
extraction can be classified into diferent approaches. According
to Beliga et al and Lima et al [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ], the methods can use Simple
statistical approaches, Linguistics approaches, Machine Learning
approaches among others.
      </p>
      <p>
        As the name suggests, Simple statistical approaches are very
simple, they do not need any training and are language and
domain independent. Keywords can be identified by using statistics
of the word such as word frequency, word co-occurrences, term
frequency-inverse document frequency (TF-IDF), N-gram statistics.
The disadvantage with using this approach is that in some
domains such as health and medical, the most important keyword
may appear only once [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ]. Linguistic approaches looks at
linguistic features of words, sentence and document such as lexical,
syntactic structure and semantic analysis [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ]. Machine Learning
approaches consists of both unsupervised and supervised. See
Section 2. Other approaches consist of a combination of the methods
described above and could also incorporate heuristic knowledge
such as the position, the length, the layout feature of terms [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ].
      </p>
      <p>
        This paper is organized as follows: We first presented related
work for the supervised and unsupervised Machine learning
approaches mainly focusing on Rapid Automatic Keyword
Extraction, RAKE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Multi-purpose automatic topic indexing, MAUI [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
followed by the approach we suggested which included all the
experimental setups performed. Thirdly we outlined a brief overview
of measures used for evaluating the results. We then presented and
discussed the results. Lastly we concluded and briefly talked about
possible future work.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        According to Lima et al and Rose et al [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], RAKE is an
unsupervised Machine Learning approach which does not require any
training and works by first selecting candidates keywords. Lima
et al and Rose et al [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] outlined RAKE’s input parameters
consisting of a stop list, a set of phrase delimiters, and a set of word
delimiters. Firstly, the document is partitioned into candidate
keywords using the phrase and word delimiters. After the selection of
candidate keywords a graph of word co-occurrences is then
created. Each candidate keywords is then assigned a score. Several
metrics were used to calculate the score namely: word frequency,
f req (w ), word degree, deд(w ) and the ratio of word degree to word
frequency defined as [ratio = fdreeдq((ww)) ]. Candidate keywords are
then ranked starting with the highest.
      </p>
      <p>
        According to Medelyan [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] MAUI was build based on four
opensource software components: the Keyphrase extraction algorithm
(Kea) used for phrase filtering and computing n-gram extractions,
Weka used for creating topic indexing models and applying them
to new documents, Jena used for incorporating controlled
vocabularies coming from external sources and Wikipedia Miner used
for accessing Wikipedia data. The four open-source software are
used together with other classes to form a single topic indexing
algorithm used to generate candidate topics, to compute their
features, to build the topic indexing model and to apply the model
to new document [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To create a model, a training dataset with
known keyphrases is required. The only keyphrases that would
then be classified will be the ones that have already been
incorporated in the training data. Candidate phrases are selected in three
steps namely: cleaning of input, phrase identification and lastly
case-folding and stemming [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. MAUI has a parameter that can be
varied in order to control the size of the training set. Some
candidates catchphrases are discarded based on their frequency of
occurrence before creating a model. This will therefore reduce the
size of the model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>PROPOSED APPROACH</title>
      <p>
        A keyword extraction library called RAKE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was used for the
unsupervised approach while MAUI [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was used for the
supervised approach. RAKE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and MAUI [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] consisted of parameters
that were fine tuned before generating catchphrases. The approach
used in this research was to set RAKE and MAUI parameters to
different values. Then use part of the training dataset with known
catchphrases for evaluation. The results of each approach were
evaluated individually in order to determine optimal parameters
that would be used for extracting catchphrases on the testing data.
We then generated the final catchphrases using the testing data
provided and the optimal parameters that yielded better results on
each approach.
3.1
3.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
    </sec>
    <sec id="sec-5">
      <title>Dataset</title>
      <p>For the IRLeD task, the dataset provided contained the following:
(1) Train docs - consisted of 100 case statements.
(2) Train catches - contained the gold standard catchwords for
each of the 100 case statements provided in the Train docs.
(3) Test docs - contained 300 test case statements. For each of
these 300 statements, a set of catchphrases was generated.</p>
      <p>The training dataset was randomly divided into two groups
consisting of 90 documents and 10 documents from the dataset. The 90
documents dataset was only used for training the supervised
machine learning approach while the remaining 10 documents dataset
were used for testing both the unsupervised and supervised
methodologies.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Experiment 1 - RAKE parameter tuning on training dataset</title>
      <p>RAKE consisted of the following parameters which were fine tuned
for diferent experiments in order to find the optimal parameter
values that yielded the best performance on the training set provided.
Table 1 provides more details on parameters experimented with as
well as performance results.</p>
      <p>(1) The number of character can be varied in order to select
keywords with a certain number of characters represented
as No of Char/word in Table 1.
(2) The number of phrases for each keyword can be tuned to
varies words represented as No of word/phrase in Table 1
(3) The number of times a keyword appears in a given text can
be limited to a certain number represented as keyword
frequency in Table 1.
3.4</p>
    </sec>
    <sec id="sec-7">
      <title>Experiment 2 - MAUI parameter tuning on training dataset</title>
      <p>As it was done in Section 3.3, parameter turning experiments were
performed in order to find the optimal parameters for MAUI. The
only parameter tuned for MAUI was to vary the frequency of
occurrence of each keyword and discard some keywords based on that.
The default MAUI parameter discards any candidate phrase(s) that
appeared less than two times. See Table 2.
3.5</p>
    </sec>
    <sec id="sec-8">
      <title>Final Run 1: Using RAKE</title>
      <p>RAKE was used to generate catchphrases for the Test documents
provided with parameters tuned to 3 3 1: meaning each word had
atleast 3 characters, each phrase had at most 3 words and each
keyword appeared in the text at least once.</p>
      <p>UBIRLeD_1 - Catchphrases were generated for each document
together with the corresponding scores for each catchphrase.
3.6</p>
    </sec>
    <sec id="sec-9">
      <title>Final Run 2: Using MAUI</title>
      <p>The supervised machine learning approach (MAUI) was used where
a classifier was trained by using all the training documents
provided with known training catchphrases in the training set. No
candidates were discarded prior to training the model. We then
used the trained model to generate catchphrases for the test
documents.</p>
      <p>UBIRLeD_2: 150 catchphrases were generated for each test
document. The highest ranked catchphrases appeared first for each test
document.
4</p>
    </sec>
    <sec id="sec-10">
      <title>EVALUATION</title>
      <p>Several measures were used to evaluate the results of the two
approaches. In this experiments we looked at Recall, Precision and
Mean Average Precision among others.
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>Recall Measure</title>
      <p>
        According to Manning et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Recall is defined as the fraction
of relevant documents that are retrieved. In this task, we were
interested in the fraction of relevant catchphrases retrieved in each
document. The formula for Recall is given in Figure 1, where tp
represents true positive; these are relevant retrieved catchphrases and
f n represents false negative; these are relevant but not retrieved
catchphrases.
      </p>
      <p>Recall =</p>
      <p>tp
tp + f n</p>
      <p>
        Recall@K would be the proportion of relevant catchphrases that
have been retrieved in the top-K.
Precision is described as the fraction retrieved documents which
are relevant according to Manning et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this time precision
will be the fraction of retrieved catchphrases that are relevant. The
formula for Precision is given in Figure 2, where tp represents true
positive and f p represents false positives; a situation in which
nonrelevant catchphrases have been retrieved as relevant.
      </p>
      <p>Precision =</p>
      <p>tp
tp + f p</p>
      <p>Precision@K, would be the proportion of top-K catchphrases
that are relevant. Mean Precision@K, will then cover the Mean of
the Precision@K of each test document in the whole collection.</p>
      <p>
        We used Manning et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]’s ideas when finding the Mean R
precision. Computing Mean R precision required knowledge of all
catchphrases that were relevant on each test document where R
represented the total number of expected relevant catchphrases
for a particular test document. R was then used as the cutof for
calculating precision. Precision would be equal to recall at the
Rth position. Suppose that R relevant catchphrases were expected
for test document Td1, and only r relevant catchphrases were
retrieved at position R. We would only calculate precision of the top
R catchphrases retrieved using the formula given in Figure 3. The
Mean R precision would be the mean of R precision of all the test
documents (queries)
4.3
      </p>
    </sec>
    <sec id="sec-12">
      <title>Mean Average Precision Measure</title>
      <p>
        Mean Average Precision (MAP ) value is defined as "the arithmetic
mean of average precision values for individual information needs"
Manning et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The formula for Mean Average Precision (MAP )
Recall
5.65
25.78
19.64
13.04
8.62
r
      </p>
      <p>R</p>
      <p>RPrecision =
is given in Figure 4, where MAP (Q ) is mean of average precision
across the whole collection list of queries being the test documents
in this task. Precision(Rjk ) is the precision score of ranked retrieved
catchphrases from the top results until position k for test
document j. For each of the test documents, a set of ranked catchphrases
was produced, which was then used to compute precision and
average precision (AP). Average precision is the mean of the precision
scores after each relevant catchphrase is retrieved.</p>
      <p>MAP (Q ) =
1 jQ j 1 mj</p>
      <p>∑ ∑ Precision(Rjk )
jQ j j=i mj k=1
Consider the results displayed in Table 3 UBIRLed_1 and UBIRLed_2
rows contain performance measures obtained after using the
generated catchphrases from RAKE and MAUI respectively as
mentioned in . Using the performance measures stated in Section 4,
we observed that MAUI; the supervised approach, performed far
much better than RAKE; the unsupervised approach. Comparing
the results based on Mean Precision@10, we discovered that the
proportion of top 10 catchphrases which were relevant was more
efective using MAUI, MAUI result was 0.254 while RAKE result
was 0.013. We also looked at the Mean Recall@100, MAUI still
outperformed RAKE by retrieving more relevant catchphrases in the
top 100. When finding MAP, the assumption was that we were
interested in finding more relevant catchphrases for each test
documents and hence we computed the Mean of average precision
values of each test documents. The value of MAP obtained for MAUI
was higher than the value computed using RAKE results. The Mean
R precision value for MAUI had far much better proportion of
retrieved catchphrases which were relevant considering the cutof
point which was equals the number of relevant catchphrases
expected for each and every document provided in the testing dataset.
Overall recall, RAKE was better although that was the only
measure good compared to MAUI’s performance.
UBIRLed_1
UBIRLed_2</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we had proposed and compared two Machine
Learning approaches namely: RAKE and MAUI for the unsupervised
and supervised approaches respectively. In the proposed approach,
ifne tuning parameters before generating candidate catchphrases
resulted in obtaining the optimal parameters for each method used.
Based on the optimal parameters used for generating the final
catchphrases, overall MAUI had high performance compared to RAKE.
The diferences in the performance was observed in most areas.
RAKE achieved the highest recall but the precision was very low
compared to MAUI. We strongly believe that Legal domain is an
area which still requires a lot of work on Information Extraction.
For the future work, we plan to experiment with diferent
techniques used on the supervised approach in Machine learning and
evaluate the performance after applying the diferent techniques.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Slobodan</given-names>
            <surname>Beliga</surname>
          </string-name>
          , Ana Metrovi, and
          <string-name>
            <surname>Sanda</surname>
          </string-name>
          Martini-Ipi.
          <year>2015</year>
          .
          <article-title>An Overview of Graph-Based Keyword Extraction Methods and Approaches</article-title>
          .
          <source>Journal of information and organizational sciences 39</source>
          ,
          <issue>1</issue>
          (
          <year>June 2015</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . http://hrcak.srce.
          <source>hr/ 140857</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Arpan</given-names>
            <surname>Mandal</surname>
          </string-name>
          , Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and
          <string-name>
            <given-names>Saptarshi</given-names>
            <surname>Ghosh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD)</article-title>
          .
          <source>In Working notes of FIRE</source>
          <year>2017</year>
          <article-title>- Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings</article-title>
          ).
          <source>CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Christopher</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
            , Prabhakar Raghavan, and
            <given-names>Hinrich</given-names>
          </string-name>
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          .
          <year>2008</year>
          . Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Olena</given-names>
            <surname>Medelyan</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Human-competitive automatic topic indexing</article-title>
          . (
          <year>2009</year>
          ). http://cds.cern.ch/record/1198029 Presented on
          <year>July 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Stuart</given-names>
            <surname>Rose</surname>
          </string-name>
          , Dave Engel, Nick Cramer, and
          <string-name>
            <given-names>Wendy</given-names>
            <surname>Cowley</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Automatic Keyword Extraction from Individual Documents</article-title>
          . (03
          <year>2010</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Lima</given-names>
            <surname>Subramanian</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.S</given-names>
            <surname>Karthik</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>KEYWORD EXTRACTION: A COMPARATIVE STUDY USING GRAPH BASED MODEL AND RAKE</article-title>
          .
          <source>Int. J. of Adv. Res</source>
          .
          <volume>5</volume>
          (
          <issue>3</issue>
          ) (
          <year>2017</year>
          ),
          <fpage>1133</fpage>
          -
          <lpage>1137</lpage>
          . https://doi.org/10.21474/IJAR01/3616
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          , Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. NevillManning.
          <year>1999</year>
          .
          <source>KEA: Practical Automatic Keyphrase Extraction. (5 Feb</source>
          .
          <year>1999</year>
          ). arXiv:cs/9902007 http://arxiv.org/abs/cs/9902007
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>