<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Catchphrase Extraction from Legal Documents Using LSTM Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rupal Bhargava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukrut Nigwekar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keyword Extraction</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Legal Documents</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deep Learning</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natural Language Processing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Retrieval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Birla Institute of Technology and Science</institution>
          ,
          <addr-line>Pilani Campus, Pilani-333031</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>WiSoc Lab, Department of Computer Science</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p> Legal texts usually have a complex structure and reading through them is a time-consuming and strenuous task. Hence it is essential to provide the legal practitioners a concise representation of the text. Catchphrases are those phrases which state the important issues present in the text, thus effectively characterizing it. This paper proposes an approach for the subtask 1 of the task IRLed (Information Retrieval from Legal Documents), FIRE 2017. The proposed algorithm uses a three step approach for extracting catchphrases from legal documents. • Information systems ! Retrieval tasks and goals • Information systems ! Information extraction</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>A prior case (also called a precedent) is an older court case
related to the current case, which discusses similar issue(s) and
which can be used as reference in the current case. If an ongoing
case has any related/relevant legal issue(s) that has already been
decided, then the court is expected to follow the interpretations
made in the prior case. For this purpose, it is critical for legal
practitioners to find and study previous court cases, so as to
examine how the ongoing issues were interpreted in the older
cases.</p>
      <p>Generally, legal texts (e.g., court case descriptions) are long and
have complex structures. This makes their thorough reading
time-consuming and strenuous. So, it is essential for legal
practitioners to have a concise representation of the core legal
issues described in a legal text. One way to list the core legal
issues is by keywords or key phrases, which are known as
“catchphrases” in the legal domain.</p>
      <p>In order to address this issue FIRE 2017 organized a task to
extract catchphrases from legal documents. The task was to
given training set of documents and their corresponding
catchphrases, extract catchphrases from new documents.
Rest of the paper is organized as follows. Section 2 explains the
related work that has been done in the past years. Section 3
describes the dataset provided by IRLed 2017 organizers. Section
4 explains the proposed technique that has been performed.
Section 5 elaborates the evaluation and error analysis. Section 6
concludes the paper and presents future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>
        Various techniques are being used for the task of keyword
extraction [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. They are broadly divided into supervised
learning, unsupervised learning and heuristic based. The goal of
supervised learning approaches was to train a classifier on
documents annotated with keyphrases to determine whether a
candidate phrase is a keyphrase (Witten et al., 1999; Frank et al.,
1999) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Another approach was to build a ranker for keyword
ranking (Jiang et al., 2009) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Unsupervised techniques proposed can be categorized
into four groups. Graph-based ranking is based on the idea to
build a graph from input document and rank its nodes according
to their importance using a ranking method (e.g., Brin and Page
(1998)) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Topic-based clustering involves grouping the
candidates into topics such that each topic is composed of only
and only those candidates (Grineva et al., 2009) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Simultaneous
learning approach is based on the assumption that important
words occur in important sentences and a sentences is important
is it contains important words (Wan et al. (2007)) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Language
modeling scores keywords based on two features, namely,
phraseness and informativeness (Tomokiyo and Hurst (2003))
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Typical heuristics include (1) using a stop word list to
remove stop words (Liu et al., 2009b) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], (2) allowing words with
certain part-of-speech tags (e.g., nouns, adjectives, verbs) to be
candidate keywords (Mihalcea and Tarau, 2004) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], (3) allowing
n-grams that appear in Wikipedia article titles to be candidates
(Grineva et al., 2009) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and (4) extracting n-grams (Witten et
al., 1999) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or noun phrases (Barker and Cornacchia, 2000) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
that satisfy pre-defined lexico-syntactic pattern(s) (Nguyen and
Phan, 2009) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATASET DESCRIPTION</title>
      <p>
        Dataset provided by the organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] contained two sets of
legal texts – training and testing. The training set was
accompanied by the catchphrases corresponding to each text.
The given catchphrases mainly consisted of words present in the
text and rarely included phrases which were not present in the
document.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. PROPOSED TECHNIQUE</title>
      <p>The problem is formulated as a classification task and the
objective is to learn a classifier using LSTM network. The
proposed methodology involves a pipelined approach and is
divided into four phases:



</p>
      <p>Pre-processing
Candidate phrase generation
Creating vector representations for the phrases</p>
      <p>Training a LSTM network
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Pre-Processing</title>
      <p>The legal texts were pre-processed in order to ensure uniformity.
Pre-processing included removal of special characters, numbers
and words which were not present in the English dictionary and
converting all characters to lower case.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Candidate Phrase Generation</title>
      <p>To generate candidates, n-grams with n in range 1 to 4 were
created from the text. A standard stop list of common English
words is taken to reduce the candidates. If the candidate starts or
ends with a stop word then it is removed. To reduce candidates
further an assumption was made that, words adjacent to given
catchphrase will not be catchphrases. The assumption is justified
as catchphrases are identified by removing stop words;
conversely stop words can be generated by removing
catchphrases. This modification to the stop list was done
simultaneously with generating catchphrases. The method
carries an inherent bias as the candidates generated from
documents used in the beginning will be chosen according to a
smaller stop list and those in the end will be according to a
larger list. To remove this bias the documents were chosen
randomly to generate candidates.
4.3</p>
      <p>Creating Vector Representation
Word vector representations were created using Google News
word-2-vec model. For phrases containing more than one word,
word vectors were combined by obtaining their weighted
average with the weights being the TFxIDF score of the
constituent words.
4.4</p>
    </sec>
    <sec id="sec-7">
      <title>Training the Model</title>
      <p>Long-Short Term Memory units were used because text is
considered to be a continuous input as the words used earlier
can affect words used later in the text. Keras framework on top
of TensorFlow backend was used to build the model. The number
of LSTM units in the model was 100, dropout was set to 0.5 and a
dense layer was added at the end to combine the outputs of the
units to give a probability.</p>
    </sec>
    <sec id="sec-8">
      <title>5. EVALUATION RESULTS</title>
      <p>The proposed method achieved mean average precision of 0.0931
and overall recall of 0.0988. The precision could be probably
improved by using a different model. Although the results are
not very good this does not rule out the possibility of using deep
learning for the task.
6.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>Catchphrases present a summary of a legal text and are very
useful for practitioners. They can be used to implement a
document retrieval system as they can be used as representation
of the document needed. This working note presents an
extraction system using LSTM network. The results are poor but
LSTM are suited to the task at hand because of its continuous
nature and hence should be explored further.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Mandal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pal</surname>
            and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          .
          <article-title>Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD)</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India, December 8-
          <issue>10</issue>
          ,
          <year>2017</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chau</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <surname>Tuoi</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Phan</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>An ontology-based approach for key phrase extraction</article-title>
          .
          <source>In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing: Short Papers</source>
          , pages
          <fpage>181</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ken</given-names>
            <surname>Barker</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nadia</given-names>
            <surname>Cornacchia</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Using noun phrase heads to extract document keyphrases</article-title>
          .
          <source>In Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence</source>
          , pages
          <fpage>40</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          , Gordon W. Paynter, Eibe Frank, Carl Gutwin, and
          <string-name>
            <surname>Craig</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Nevill-Manning</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>KEA: Practical automatic keyphrase extraction</article-title>
          .
          <source>In Proceedings of the 4th ACM Conference on Digital Libraries</source>
          , pages
          <fpage>254</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Maria</given-names>
            <surname>Grineva</surname>
          </string-name>
          , Maxim Grinev, and
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Lizorkin</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Extracting key terms from noisy and multitheme documents</article-title>
          .
          <source>In Proceedings of the 18th International Conference on World Wide Web</source>
          , pages
          <fpage>661</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Tarau</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>TextRank: Bringing order into texts</article-title>
          .
          <source>In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>404</fpage>
          -
          <lpage>411</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Zhiyuan</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yabin</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Maosong</given-names>
            <surname>Sun</surname>
          </string-name>
          . 2009b.
          <article-title>Clustering to find exemplar terms for keyphrase extraction</article-title>
          .
          <source>In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>257</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Takashi</given-names>
            <surname>Tomokiyo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Hurst</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>A language model approach to keyphrase extraction</article-title>
          .
          <source>In Proceedings of the ACL Workshop on Multiword Expressions</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Xiaojun</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianwu</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jianguo</given-names>
            <surname>Xiao</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction</article-title>
          .
          <source>In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics</source>
          , pages
          <fpage>552</fpage>
          -
          <lpage>559</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Brin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Page</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>The anatomy of a large-scale hypertextual Web search engine</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>30</volume>
          (
          <issue>1-7</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Xin</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Yunhua Hu, and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>A ranking approach to keyphrase extraction</article-title>
          .
          <source>In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>756</fpage>
          -
          <lpage>757</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Kazi</given-names>
            <surname>Saidul</surname>
          </string-name>
          Hasan and
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Automatic Keyphrase Extraction: A Survey of the State of the Art</article-title>
          .
          <source>In Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pages
          <fpage>1262</fpage>
          -
          <lpage>1273</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>