=Paper= {{Paper |id=Vol-2036/T3-3 |storemode=property |title=Catchphrase Extraction from Legal Documents Using LSTM Networks |pdfUrl=https://ceur-ws.org/Vol-2036/T3-3.pdf |volume=Vol-2036 |authors=Rupal Bhargava,Sukrut Nigwekar,Yashvardhan Sharma |dblpUrl=https://dblp.org/rec/conf/fire/BhargavaNS17 }} ==Catchphrase Extraction from Legal Documents Using LSTM Networks== https://ceur-ws.org/Vol-2036/T3-3.pdf
    Catchphrase Extraction from Legal Documents Using LSTM
                            Networks
             Rupal Bhargava1                              Sukrut Nigwekar2                            Yashvardhan Sharma3
                                                       WiSoc Lab, Department of Computer Science
                                          Birla Institute of Technology and Science, Pilani Campus, Pilani-333031
                                               {rupal.bhargava1, f20150292, yash3} @pilani.bits-pilani.ac.in

ABSTRACT                                                                       Section 5 elaborates the evaluation and error analysis. Section 6
                                                                                concludes the paper and presents future work.
Legal texts usually have a complex structure and reading through them is
a time-consuming and strenuous task. Hence it is essential to provide the
legal practitioners a concise representation of the text. Catchphrases are      2. RELATED WORK
those phrases which state the important issues present in the text, thus
effectively characterizing it. This paper proposes an approach for the          Various techniques are being used for the task of keyword
subtask 1 of the task IRLed (Information Retrieval from Legal                   extraction [12]. They are broadly divided into supervised
Documents), FIRE 2017. The proposed algorithm uses a three step                 learning, unsupervised learning and heuristic based. The goal of
approach for extracting catchphrases from legal documents.
                                                                                supervised learning approaches was to train a classifier on
                                                                                documents annotated with keyphrases to determine whether a
CCS Concepts                                                                    candidate phrase is a keyphrase (Witten et al., 1999; Frank et al.,
• Information systems ! Retrieval tasks and goals • Information                 1999) [4]. Another approach was to build a ranker for keyword
systems ! Information extraction                                                ranking (Jiang et al., 2009) [11].
                                                                                           Unsupervised techniques proposed can be categorized
Keywords                                                                        into four groups. Graph-based ranking is based on the idea to
Keyword Extraction; Legal Documents; Deep Learning; LSTM; Natural               build a graph from input document and rank its nodes according
Language Processing; Information Retrieval                                      to their importance using a ranking method (e.g., Brin and Page
                                                                                (1998)) [10]. Topic-based clustering involves grouping the
1. INTRODUCTION                                                                 candidates into topics such that each topic is composed of only
                                                                                and only those candidates (Grineva et al., 2009) [5]. Simultaneous
A prior case (also called a precedent) is an older court case                   learning approach is based on the assumption that important
related to the current case, which discusses similar issue(s) and               words occur in important sentences and a sentences is important
which can be used as reference in the current case. If an ongoing               is it contains important words (Wan et al. (2007)) [9]. Language
case has any related/relevant legal issue(s) that has already been              modeling scores keywords based on two features, namely,
decided, then the court is expected to follow the interpretations               phraseness and informativeness (Tomokiyo and Hurst (2003))
made in the prior case. For this purpose, it is critical for legal              [8].
practitioners to find and study previous court cases, so as to                             Typical heuristics include (1) using a stop word list to
examine how the ongoing issues were interpreted in the older                    remove stop words (Liu et al., 2009b) [7], (2) allowing words with
cases.                                                                          certain part-of-speech tags (e.g., nouns, adjectives, verbs) to be
Generally, legal texts (e.g., court case descriptions) are long and             candidate keywords (Mihalcea and Tarau, 2004) [6], (3) allowing
have complex structures. This makes their thorough reading                      n-grams that appear in Wikipedia article titles to be candidates
time-consuming and strenuous. So, it is essential for legal                     (Grineva et al., 2009) [5], and (4) extracting n-grams (Witten et
practitioners to have a concise representation of the core legal                al., 1999) [4] or noun phrases (Barker and Cornacchia, 2000) [3]
issues described in a legal text. One way to list the core legal                that satisfy pre-defined lexico-syntactic pattern(s) (Nguyen and
issues is by keywords or key phrases, which are known as                        Phan, 2009) [2].
“catchphrases” in the legal domain.
In order to address this issue FIRE 2017 organized a task to
extract catchphrases from legal documents. The task was to                      3.   DATASET DESCRIPTION
given training set of documents and their corresponding                         Dataset provided by the organizers [1] contained two sets of
catchphrases, extract catchphrases from new documents.                          legal texts – training and testing. The training set was
                                                                                accompanied by the catchphrases corresponding to each text.
Rest of the paper is organized as follows. Section 2 explains the               The given catchphrases mainly consisted of words present in the
related work that has been done in the past years. Section 3                    text and rarely included phrases which were not present in the
describes the dataset provided by IRLed 2017 organizers. Section                document.
4 explains the proposed technique that has been performed.


4.    PROPOSED TECHNIQUE                                              not very good this does not rule out the possibility of using deep
                                                                      learning for the task.
The problem is formulated as a classification task and the
objective is to learn a classifier using LSTM network. The
proposed methodology involves a pipelined approach and is             6.   CONCLUSION
divided into four phases:                                             Catchphrases present a summary of a legal text and are very
                                                                      useful for practitioners. They can be used to implement a
         Pre-processing
                                                                      document retrieval system as they can be used as representation
         Candidate phrase generation
                                                                      of the document needed. This working note presents an
         Creating vector representations for the phrases
                                                                      extraction system using LSTM network. The results are poor but
         Training a LSTM network
                                                                      LSTM are suited to the task at hand because of its continuous
                                                                      nature and hence should be explored further.
4.1   Pre-Processing
The legal texts were pre-processed in order to ensure uniformity.
Pre-processing included removal of special characters, numbers        References
and words which were not present in the English dictionary and         [1] Mandal, K. Ghosh, A. Bhattacharya, A. Pal and S. Ghosh. Overview
                                                                           of the FIRE 2017 track: Information Retrieval from Legal
converting all characters to lower case.
                                                                           Documents (IRLeD). In Working notes of FIRE 2017 – Forum for
                                                                           Information Retrieval Evaluation, Bangalore, India, December 8-10,
4.2   Candidate Phrase Generation                                          2017, CEUR Workshop Proceedings. CEUR-WS.org, 2017.
                                                                       [2] Chau Q. Nguyen and Tuoi T. Phan. 2009. An ontology-based
To generate candidates, n-grams with n in range 1 to 4 were                approach for key phrase extraction. In Proceedings of the Joint
created from the text. A standard stop list of common English              Conference of the 47th Annual Meeting of the Association for
words is taken to reduce the candidates. If the candidate starts or        Computational Linguistics and the 4th International Joint Conference
                                                                           on Natural Language Processing: Short Papers, pages 181–184.
ends with a stop word then it is removed. To reduce candidates         [3] Ken Barker and Nadia Cornacchia. 2000. Using noun phrase heads
further an assumption was made that, words adjacent to given               to extract document keyphrases. In Proceedings of the 13th Biennial
catchphrase will not be catchphrases. The assumption is justified          Conference of the Canadian Society on Computational Studies of
as catchphrases are identified by removing stop words;                     Intelligence, pages 40–52.
                                                                       [4] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and
conversely stop words can be generated by removing
                                                                           Craig G. Nevill-Manning. 1999. KEA: Practical automatic keyphrase
catchphrases. This modification to the stop list was done                  extraction. In Proceedings of the 4th ACM Conference on Digital
simultaneously with generating catchphrases. The method                    Libraries, pages 254–255.
carries an inherent bias as the candidates generated from              [5] Maria Grineva, Maxim Grinev, and Dmitry Lizorkin. 2009.
documents used in the beginning will be chosen according to a              Extracting key terms from noisy and multitheme documents. In
                                                                           Proceedings of the 18th International Conference on World Wide Web,
smaller stop list and those in the end will be according to a              pages 661–670.
larger list. To remove this bias the documents were chosen             [6] Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into
randomly to generate candidates.                                           texts. In Proceedings of the 2004 Conference on Empirical Methods in
                                                                           Natural Language Processing, pages 404–411.
                                                                       [7] Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. 2009b.
4.3   Creating Vector Representation                                       Clustering to find exemplar terms for keyphrase extraction. In
Word vector representations were created using Google News                 Proceedings of the 2009 Conference on Empirical Methods in Natural
                                                                           Language Processing, pages 257–266.
word-2-vec model. For phrases containing more than one word,           [8] Takashi Tomokiyo and Matthew Hurst. 2003. A language model
word vectors were combined by obtaining their weighted                     approach to keyphrase extraction. In Proceedings of the ACL
average with the weights being the TFxIDF score of the                     Workshop on Multiword Expressions, pages 33–40.
constituent words.                                                     [9] Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Towards an
                                                                           iterative reinforcement approach for simultaneous document
                                                                           summarization and keyword extraction. In Proceedings of the 45th
4.4   Training the Model                                                   Annual Meeting of the Association of Computational Linguistics,
                                                                           pages 552–559.
Long-Short Term Memory units were used because text is                [10] Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale
considered to be a continuous input as the words used earlier              hypertextual Web search engine. Computer Networks, 30(1–7):107–
can affect words used later in the text. Keras framework on top            117.
of TensorFlow backend was used to build the model. The number         [11] Xin Jiang, Yunhua Hu, and Hang Li. 2009. A ranking approach to
                                                                           keyphrase extraction. In Proceedings of the 32nd International ACM
of LSTM units in the model was 100, dropout was set to 0.5 and a           SIGIR Conference on Research and Development in Information
dense layer was added at the end to combine the outputs of the             Retrieval, pages 756–757.
units to give a probability.                                          [12] Kazi Saidul Hasan and Vincent Ng. 2014. Automatic Keyphrase
                                                                           Extraction: A Survey of the State of the Art. In Proc. of the 52nd
                                                                           Annual Meeting of the Association for Computational Linguistics
5.    EVALUATION RESULTS                                                   (ACL), pages 1262-1273.
The proposed method achieved mean average precision of 0.0931
and overall recall of 0.0988. The precision could be probably
improved by using a different model. Although the results are