<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CUSAT NLP@AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sara Renjit</string-name>
          <email>sararenjit.g@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sumam Mary Idicula</string-name>
          <email>sumam@cusat.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Embed-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Professor, Dept. of Computer Science</institution>
          ,
          <addr-line>CUSAT, Kochi-682022</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Scholar, Dept. of Computer Science</institution>
          ,
          <addr-line>CUSAT, Kochi-682022</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Text retrieval has taken its role in almost all domains of knowledge understanding. It has applications in the legal eld where there is an extensive collection of structured and unstructured texts. Arti cial Intelligence is now applied in this area to understand and retrieve legal documents. This paper explains a working model developed for the track Arti cial Intelligence for Legal Assistance in Forum for Information Retrieval Evaluation, 2019 (AILA-FIRE2019). We have used an embedding model approach to represent these legal texts in a semantic vector space. The similarity between these document embeddings is found using an existing method of cosine similarity. The corpus used for building embedding models is the dataset provided in AILA-FIRE2019.</p>
      </abstract>
      <kwd-group>
        <kwd>Legal texts dings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Nowadays, the amount of legal content available from online sources is high. The
rapid change in handling legal documents and related pieces of information in
electronic form has increased the need for using arti cial intelligence in this
domain. Interfaces that provide a semantic understanding of texts would be useful
for non-specialists as well as legal practitioners in understanding statutes and
related prior cases. A layman gets the basic knowledge about legal proceedings
by looking through the most relevant precedent cases and statutes retrieved for
each query situation.</p>
      <p>The rest of this paper is organized as follows: Section 2 presents related works
in the area of document retrieval. Task description and dataset details are
provided in Section 3. Section 4 explains the methodology used. Section 5 relates
to experimental details and evaluation results. Finally, Section 6 concludes the
work with some future improvements.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        Text retrieval also called document retrieval, matches some query against a set of
text records either structured or unstructured. Methods to retrieve texts is more
or less based on some similarity measures. Various clustering methods were also
employed. Graph-based approaches were also used to nd document similarity
using modi ed shortest path graph kernals[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A dual embedding space model
is proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where both input and output word vector embeddings are
used for word similarity estimation in document ranking. Text retrieval in the
legal domain has been an emerging research topic nowadays, and few works are
studied. A legal document retrieval system was developed for the general public
in China using normalized Google Distance to nd the semantic relatedness
between layman terms and legal terms[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Task Description &amp; Dataset Details</title>
      <p>AILA Track consists of two tasks, namely precedent retrieval (Task 1) and
statute retrieval (Task 2). Task 1 aims at nding relevant previous case
documents/ precedents for a situation mentioned in the query document. Task 2
retrieves the most similar statutes for a scenario given in the query.</p>
      <p>
        The table below shows the statistics of the dataset for both evaluations. It
is a collection of Indian legal documents, i.e., statutes in India and prior cases
decided by Indian courts of Law.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
We have used document embedding model with cosine similarity to understand
the similarity between documents. Paragraph vector [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is an unsupervised
algorithm that learns feature representations from sentences and texts, is used
as document embedding model in this context.
      </p>
      <p>Paragraph vectors are inspired from architecture for learning word vector
representations, in which document or paragraph id is assigned to each
paragraph. Each word in a paragraph is represented by a column vector in a matrix,
say W. Concatenation or sum of these vectors are features for predicting next
word in the sentence. For sequence of words w1,w2,...,wT word vector model
maximizes the average log probability
1 T k</p>
      <p>X log p(wtjwt k; :::; wt+k)
T
t=k
(1)
Prediction is done using a softmax classi er.</p>
      <p>p(wtjwt k; :::; wt+k) =
eywt
P eyi
i
where, yi can be computed as unnormalized log probability for each output word
i using</p>
      <p>y = b + U h(wt-k; :::; wt+k; W )
where U, b are the softmax parameters. h is constructed by a concatenation
or average of word vectors and paragraph vectors extracted from W and D
respectively.</p>
      <p>The paragraph vectors and word vectors are averaged or concatenated to
predict the next word in a context. Paragraph vectors and word vectors are
trained using stochastic gradient descent, and then paragraph vector for new
paragraphs can be inferred. The two stages in paragraph vector algorithm are:
1) Training word vectors W, softmax weights U, b and paragraph vectors D on
already seen paragraphs. 2) Inference step derives paragraph vectors for new
paragraphs by adding more columns in D and gradient descending on D keeping
W, U, b xed.</p>
      <p>There are two variations of this model: PV-DM (Distributed Memory Model
of Paragraph Vectors) which considers word order with the concatenation of
word vectors and paragraph vector. The missing context information is in the
paragraph vector, and it memorizes the semantic content of the paragraph.
PVDBOW (Distributed Bag of Words model of Paragraph Vector) neglects word
ordering by taking random samples of words to predict another word. Since
word ordering is essential to understand the semantics of a text, a distributed
memory model is preferred while training the paragraph vector model. Vector
representation for documents are inferred from this model and are then compared
using Cosine similarity.</p>
      <p>Cosine similarity [8] measures the angle between two vectors in
multidimensional space. Cosine of two vectors is : A:B = jAjjBjcos and cosine similarity,
given two vectors A and B, is de ned as:
similarity = cos =</p>
      <p>A:B
jAjjBj</p>
      <p>Pn
= pPn i=1 AiBii=1 Bi2
i=1 Ai2pPn
Using this similarity metric, we can understand the orientation of documents
as its based on angle measurement. Documents with similar orientations can be
considered as similar and if the document vectors are created in such a way that
the semantics of element words are taken care of, then these orientations can
also be semantically similar.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments &amp; Results</title>
      <p>A work ow of the proposed system is presented in Figure 2. The collection of
precedent documents, statutes and query cases are provided in form of text les
(2)
(3)
(4)
as input. The rst step is tokenization of each document to a collection of tokens
or words and then each collection is labelled with a tag and number to uniquely
identify documents. This represents paragraph id in the paragraph vector model
discussed above.</p>
      <p>In the next phase, doc2vec model from gensim library is used for training
the labeled precedents, query, and statutes. Doc2vec model is trained using the
precedents and query documents collection with ve epochs and embedding
dimension set as 100. This model is then used to obtain the vector representation
of each document for precedents retrieval task. Statutes collection is also added
to the existing doc2vec corpus for retrieving vector representations for task 2. In
the nal stage, vector representations obtained for each case document,
precedents and statutes are compared using cosine similarity.</p>
      <p>For Task 1, cosine similarity between present case situation and precedents
are calculated, sorted, and ranked. Similarly, for Task 2, cosine similarity between
each present case situation and statutes are found. The similarity scores are then
sorted and ranked. The nal result gives a list of ranked documents for each case
scenario presented in the query document.</p>
      <p>
        The proposed system uses 50 cases as query documents out of which ten
queries were used for pre-evaluation and 40 in the nal evaluation. TREC
Evaluation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is used to evaluate the system performance. Various evaluation measures
used in this evaluation are:
{ Precision@10: The precision (percent of retrieved docs that are relevant)
after ten documents have been retrieved, and these values are averaged over
all queries.
{ Mean Average Precision: It is the average precision across multiple queries
/rankings, and average precision is the average of all P@K (say 10).
{ Binary Preference: It computes a preference relation of whether judged
relevant documents are retrieved ahead of judged irrelevant documents.
{ Reciprocal Rank : Reciprocal rank of top relevant document. [7].
The table below shows the results of the proposed system on both evaluations.
This paper presented a paragraph vector and cosine similarity based document
retrieval approach for legal documents. Legal document retrieval involves
understanding the semantics of legal ontology. E cient sentence modeling approaches
that can semantically model legal documents can further improve the
performance of this system. Semantic understanding needs to be further improved to
distinguish and understand the similarity between legal terms and layman terms.
7. Wikipedia contributors: Mean reciprocal rank | Wikipedia, the free
encyclopedia. https://en.wikipedia.org/w/index.php?title=Mean_reciprocal_rank&amp;
oldid=872349108 (2018), [Online; accessed 3-September-2019]
8. Wikipedia contributors: Cosine similarity | Wikipedia, the free
encyclopedia. https://en.wikipedia.org/w/index.php?title=Cosine_similarity&amp;oldid=
910391235 (2019), [Online; accessed 3-September-2019]
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the FIRE 2019 AILA track: Arti cial Intelligence for Legal Assistance</article-title>
          .
          <source>In: Proceedings of FIRE 2019 - Forum for Information Retrieval Evaluation (December</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>W.L.:</given-names>
          </string-name>
          <article-title>A text mining approach to assist the general public in the retrieval of legal documents</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>64</volume>
          (
          <issue>2</issue>
          ),
          <volume>280</volume>
          {
          <fpage>290</fpage>
          (
          <year>2013</year>
          ). https://doi.org/10.1002/asi.22767, https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.22767
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In: International conference on machine learning</source>
          . pp.
          <volume>1188</volume>
          {
          <issue>1196</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nalisnick</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craswell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caruana</surname>
          </string-name>
          , R.:
          <article-title>Improving document ranking with dual word embeddings</article-title>
          .
          <source>In: Proceedings of the 25th International Conference Companion on World Wide Web</source>
          . pp.
          <volume>83</volume>
          {
          <fpage>84</fpage>
          . WWW '16 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (
          <year>2016</year>
          ). https://doi.org/10.1145/2872518.2889361, https://doi.org/ 10.1145/2872518.2889361
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nikolentzos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meladianos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rousseau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stavrakas</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vazirgiannis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Shortest-path graph kernels for document similarity</article-title>
          .
          <source>In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <year>1890</year>
          {
          <year>1900</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Copenhagen, Denmark (Sep
          <year>2017</year>
          ). https://doi.org/10.18653/v1/
          <fpage>D17</fpage>
          -1202, https://www.aclweb. org/anthology/D17-1202
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. NIST:
          <article-title>Evaluation scripts for text retrieval conference</article-title>
          , http://trec.nist.gov/ trec_eval/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>