<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pizzo Calabro (VV),
Italy
" graziella.demartino@uniba.it ( Graziella De Martino); gianvito.pio@uniba.it ( Gianvito Pio);
michelangelo.ceci@uniba.it ( Michelangelo Ceci)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Two-Step Method based on Embedding and Clustering to Identify Regularities in Legal Case Judgements</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graziella De Martino</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianvito Pio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Ceci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Legal Information Retrieval, Embedding, Clustering, Approximate Nearest Neighbor Search</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Big Data Laboratory</institution>
          ,
          <addr-line>National Interuniversity Consortium for Informatics, Via Ariosto, 25, 00185, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science - University of Bari Aldo Moro</institution>
          ,
          <addr-line>Via Orabona, 4, 70125, Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Knowledge Technologies, Jožef Stefan Institute</institution>
          ,
          <addr-line>Jamova cesta 39, 1000 Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In an era characterized by fast technological signs of progress that introduce new scenarios every day, working in the law field may appear very dificult if not supported by the right tools. In this paper, we discuss a recently submitted work that proposes a novel method, called PRILJ, that identifies paragraph regularities in legal case judgments to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embeddingbased methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to eficiently retrieve the most similar paragraphs with respect to the paragraphs of a document under preparation. Our extensive experimental evaluation, performed on a real-world dataset, proves the efectiveness and the eficiency of the proposed method even if documents contain noisy data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The legal sector is generally characterized by a slow response to the new scenarios that appear
every day in the modern society. In this context, Artificial Intelligence (AI) methods can support
the design of advanced (also automated) solutions to improve the eficiency of the processes
in this field. Among the attempts in this direction, we can mention the work presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
where the authors applied AI techniques to measure the similarity among legal case documents,
that can be useful to speed up the identification and analysis of judicial precedents. Another
relevant example is the work in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where the authors consider the semi-automation of some
legal tasks, such as the prediction of judicial decisions of the European Court of Human Rights.
      </p>
      <p>Following this line of research, in this discussion paper, we describe a novel method, called
PRILJ, that identifies paragraph regularities in legal case judgements, to support legal experts
during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach
that first groups documents into clusters, according to their semantic content, and then identifies
regularities in the paragraphs for each cluster. Embedding-based methods are adopted to
properly represent documents and paragraphs into a semantic numerical feature space, and
an Approximated Nearest Neighbor Search method is adopted to eficiently retrieve the most
similar paragraphs. Therefore, given a (possibly incomplete or under preparation) document,
henceforth called target document, PRILJ supports the retrieval of similar paragraphs appearing
in a set of reference documents related to previous transcribed legal case judgments.</p>
      <p>
        Document clustering has received a lot of attention by the research community, but together
with the design of advanced algorithms [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ], the most critical aspect is in the design of
a proper representation of the objects/items at hand [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], as well as of similarity measures.
In the literature we can find several document similarity measures implemented through a)
network-based approaches [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], b) text-based methods [
        <xref ref-type="bibr" rid="ref1 ref11">11, 1</xref>
        ] or c) hybrid approaches [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>In this context, PRILJ has the main advantage of properly combining embedding methods,
to catch the semantics, with a two-step approach, that consists in learning a diferent
representation for each group of documents, rather than one single model. This aspect allows us to
capture peculiarities of paragraphs according to the specific topic represented by each cluster
of documents.</p>
      <p>Our extensive experimental evaluation, performed on a real-world dataset, proves the
efectiveness and the eficiency of the proposed method. In particular, its ability of modeling diferent
topics of legal documents, as well as of capturing the semantics of the textual content, appear
very beneficial for the considered task, and make PRILJ very robust to the possible presence of
noise in the data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>Before describing PRILJ, in the following, we provide some useful definitions:
• Training set  : a collection of legal judgments, represented as textual documents,
adopted to train our models;
• Reference set : a collection of legal judgments, represented as textual documents,
from which we are interested to identify paragraph regularities;
• Target document : a legal judgment (possibly under preparation) about which we are
interested to identify paragraph regularities from .</p>
      <p>The training set and the reference set may fully (or partially) overlap i.e.,  =  (or
 ∩  ̸= ∅), namely, the set of documents adopted to train our models may be the same as
(or overlap with) the collection from which we want to identify paragraph regularities with
respect to the target document. Note that PRILJ is fully unsupervised and the target document
 is never contained in either the training set or in the reference set (i.e.,  ∈/ ( ∪ )).</p>
      <p>The three phases of PRILJ are detailed in the following subsections.</p>
      <sec id="sec-2-1">
        <title>2.1. Training phase</title>
        <p>As shown in Fig. 1, PRILJ starts with the application of some pre-processing steps to the
documents in . Specifically, the pre-processing consists of: i) lowercasing the text, ii)
removing punctuation and digits, iii) applying lemmatization, and iv) removing rare words. The
pre-processed documents are then used to train a document embedding model  , that is
subsequently exploited to represent each document of the training set  in the latent feature space,
obtaining the set of embedded training documents  . Such documents are then partitioned
into  clusters [1, 2, ..., ] by adopting the -means clustering algorithm. Each cluster of
documents becomes the input for a further learning step at the paragraph level: documents
falling in the same cluster will contribute to the learning of a specific paragraph embedding
model. Algorithmically, for each document cluster , 1 ≤  ≤ , we extract the paragraphs
(i.e., sentences delimited by a full stop) from the documents falling into  and train a paragraph
embedding model . This approach allows us to learn more specific paragraph embedding
models, according to the topic possibly represented by the identified clusters.</p>
        <p>
          The embedding models, both at the document level and at the paragraph level, are learned
by PRILJ through neural network architectures based on Word2Vec Continuous-Bag-of-Words
(CBOW) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] or Doc2Vec [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] distributed memory distributed memory (PV-DM). This choice
is motivated by the fact that previous works demonstrated the superiority of Word2Vec and
Doc2Vec over classical counting-based approaches, since they take into account both the syntax
and semantics of the text [
          <xref ref-type="bibr" rid="ref1 ref12">12, 1</xref>
          ]. In addition, their ability to catch the semantics and the context
of single words and paragraphs allow them to properly represent new (previously unseen)
documents which features have not been explicitly observed during the training phase.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Paragraph embedding of the reference set</title>
        <p>In Fig. 2, we show the workflow followed by PRILJ to represent the paragraphs of the documents
belonging to the reference set into a latent feature space. Analogously to the training phase,
we pre-process the documents of the reference set . Then, each document of the reference
set is embedded using the previously learned document embedding model  . The embedded
representation of the document is then used to identify the closest document cluster that
corresponds the optimal paragraph embedding model (i.e., ). We stress the fact that PRILJ
performs this step to identify the most proper paragraph embedding model to represent the
paragraphs of a given document.</p>
        <p>The set of all the embedded paragraphs  is finally returned. Paragraph regularities for a
given target document  will be identified from such set .</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Identification of paragraph regularities</title>
        <p>The final phase, which workflow is represented in Fig. 3, starts by following the same steps
mentioned in Sec. 2.2 to represent each paragraph of the target document  in the paragraph
embedding space. Specifically, the most proper paragraph embedding model is adopted to
embed its paragraphs, selected by identifying the closest document cluster with respect to .
For each embedded paragraph, we finally identify the top-  most similar paragraphs from the
set of embedded paragraphs  belonging to the reference set.</p>
        <p>
          It is noteworthy that their identification could straightforwardly be based on the computation
of vector-based similarity/distance measures (e.g., cosine similarity, Euclidean distance, etc.)
between the embedded paragraphs of the target document  and all the embedded paragraphs
of the reference set . Such a pairwise comparison would be computational intensive and
would lead to ineficiencies during the adoption of the proposed system in a real-world scenario.
To overcome this issue, we adopt a more advanced method for the identification of the top- 
most similar paragraphs, based on random projections. In particular, we propose an approach
based on Annoy [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], where the idea is to perform an approximated nearest neighbour search
(ANNS), consisting in two phases: index construction on the paragraphs of the reference set,
and search, that occurs when we actually need to identify the top- most similar paragraphs
with respect to a paragraph of the target document. During the index construction, we build
 binary trees, where each tree is built by partitioning the input set of vectors recursively, by
randomly selecting two vectors and defining a hyperplane that is equidistant from them. It is
noteworthy that even if based on a random partitioning, vectors that are close to each other in
the feature space are more likely to appear close to each other in the tree. During the search
process, a priority queue is exploited, and each tree is recursively traversed, where the priority
of each split node is defined according to the distance to the query vector (that is a paragraph of
the target document, in our case). This process leads to the identification of  leaf nodes, where
the query vector falls into. The distance between the query vector and the set of vectors falling
into the identified leaves is finally exploited to return the top-  most similar paragraphs [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>All the experiments were performed using a real-world dataset consisting of 4,181 oficial public
EU legal documents, provided by EUR-Lex (https://eur-lex.europa.eu/homepage.html), in a
10-fold cross-validation setting. All the documents of the testing set were considered as target
documents, while the reference set was built by constructing 20 replicas of each paragraph of
the documents in the testing set, perturbed by introducing a controlled amount of noise. In
particular, noise was introduced by replacing a given percentage of words of each paragraph
by random words selected from the Oxford dictionary (raw.githubusercontent.com/cduica/
Oxford-Dictionary-Json/master/dicts.json). In our experiments, we considered diferent levels
of noise, namely, 10%, 20%, 30%, 40%, 50% and 60%, in order to evaluate the robustness of the
proposed approach to diferent amounts of noise.</p>
      <p>In order to assess the specific contribution of the adopted embedding strategies, we compared
the results obtained through Word2Vec and Doc2Vec with those achieved using a baseline
strategy, i.e., the classical TF-IDF. In all the cases, we adopted a 50-dimensional feature vector.
Note that we use 50 features, since it is a commonly used dimensionality in other pre-trained
embedding models. For TF-IDF, we selected the top-50 words showing the highest frequency
across the set of legal judgments.</p>
      <p>We specifically evaluated the contribution of the two-step model implemented in PRILJ
with diferent numbers of clusters, i.e.,  ∈ {√︀| |/2, √︀| |, √︀| | · 2}, and compared the
observed performance with that obtained without grouping training documents into clusters
(henceforth denoted as one-step model).</p>
      <p>Finally, we evaluated the efectiveness and the eficiency of the approach implemented in PRILJ
for the identification of the -n most similar paragraphs based on ANNS (with  = 100).
Specifically, we performed an additional comparative analysis against a non-approximated
solution based on the cosine similarity, on a subset of 100 documents randomly selected from
the dataset. This analysis was performed considering the best number of clusters , and also
focused on evaluating the advantages in terms of computational eficiency.</p>
      <p>As evaluation measures, we collected precision@n, recall@n and f1-score@n, averaged over
the paragraphs of target documents and over the 10 folds, with  ∈ {5, 10, 15, 20, 50, 100}.
Specifically, for each paragraph of a target document in the testing set, we considered as True
Positives the number of correctly retrieved (perturbed) replicas from the reference set. Note that,
in this discussion paper, for space constraints we only show the results in terms of f1-score@20.</p>
      <sec id="sec-3-1">
        <title>3.1. Results</title>
        <p>
          In Fig. 4 we can observe that, although the baseline based on TF-IDF obtained acceptable
results, the adoption of the embedding methods implemented in PRILJ is significantly beneficial.
Moreover, although Doc2Vec is natively able to work with word sequences, Word2Vec always
obtains better results. This is possibly due to the fact that several paragraphs of diferent legal
documents may share a similar topic, and the adoption of the unique sequence ID to associate
the context with the document, as done by Doc2Vec (see [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for details), may lead to overfitting
issues.
        </p>
        <p>In Fig. 5, it is possible to clearly observe the contribution of the two-step process we propose.
Indeed, the results show that the proposed two-step model outperforms the one-step model, in
all the situations. In particular, the two-step model is much more robust to the presence of noise:
although we can still observe a decrease when the noise amount increases, its impact is much
less evident. We can also observe that in general, the number of extracted cluster  seems to not
significantly afect the results, even if the best results are observed with  = √︀| | · 2. This
means that the documents are distributed among several topics and that learning a diferent
(more specialized) paragraph embedding model for each of them is helpful to retrieve significant
paragraph regularities.</p>
        <p>Finally, the comparison between the adopted ANNS and the exact computation of the cosine
similarity emphasized a diference of 0.6% in terms of f1-score@n, which can be considered
negligible. On the other hand, the advantage in terms of eficiency is significant: the exact
search required up to 1000x the time took by the ANNS implemented in PRILJ (see Table 1).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this work, we discussed PRILJ, a novel approach to identify paragraph regularities in legal
judgments. PRILJ represents documents and paragraphs thereof in a numerical feature space
by exploiting embedding methods able to catch the context and the semantics. Moreover,
PRILJ is based on a two-step model, that groups similar documents into clusters and, for each
of them, learns a specific paragraph embedding model. This approach allows us to properly
catch peculiarities exhibited by paragraphs and documents of similar topics and to handle the
presence of noise in a robust manner. Finally, PRILJ is able to identify paragraph regularities
very eficiently, thanks to an ANNS strategy.</p>
      <p>Our extensive experimental evaluation has shown the accuracy and the eficiency of the
developed approach on real data. This means that PRILJ can be considered a useful tool in
real-world scenarios, also when large collections of legal documents have to be analyzed.
GP acknowledges the support of Ministry of Universities and Research through the project “Big
Data Analytics”, AIM 1852414-1 (line 1).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Measuring similarity among legal court case documents</article-title>
          ,
          <source>in: Proc. of the 10th Annual ACM India Compute Conference, Association for Computing Machinery</source>
          ,
          <year>2017</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Medvedeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vols</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Wieling,</surname>
          </string-name>
          <article-title>Using machine learning to predict decisions of the european court of human rights</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>28</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Berkhin</surname>
          </string-name>
          ,
          <article-title>Survey of clustering data mining techniques, A Survey of Clustering Data Mining Techniques</article-title>
          .
          <source>Grouping Multidimensional Data: Recent Advances in Clustering</source>
          .
          <volume>10</volume>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          ,
          <source>in: Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD'96</source>
          ,
          <year>1996</year>
          , p.
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Loglisci</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. D'Elia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Malerba</surname>
          </string-name>
          , Hierarchical and Overlapping CoClustering of mRNA: miRNA Interactions,
          <source>in: ECAI</source>
          <year>2012</year>
          , volume
          <volume>242</volume>
          <source>of Frontiers in Artificial Intelligence and Applications</source>
          , IOS Press,
          <year>2012</year>
          , pp.
          <fpage>654</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Corizzo</surname>
          </string-name>
          , G. Pio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          ,
          <article-title>DENCAST: distributed density-based clustering for multi-target regression</article-title>
          ,
          <source>J. Big Data</source>
          <volume>6</volume>
          (
          <year>2019</year>
          )
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Distributed representations of sentences and documents</article-title>
          ,
          <source>31st International Conference on Machine Learning, ICML 2014 4</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. B.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Similarity analysis of legal judgments</article-title>
          ,
          <source>in: Proceedings of the 4th Bangalore Annual Compute Conference, Compute</source>
          <year>2011</year>
          , Bangalore, India, March
          <volume>25</volume>
          -26,
          <year>2011</year>
          , ACM,
          <year>2011</year>
          , p.
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Minocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Finding relevant indian judgments using dispersion of citation network</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on World Wide Web, Association for Computing Machinery</source>
          ,
          <year>2015</year>
          , p.
          <fpage>1085</fpage>
          -
          <lpage>1088</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. B.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suri</surname>
          </string-name>
          ,
          <article-title>Finding similar legal judgements under common law system</article-title>
          ,
          <source>in: Databases in Networked Information Systems</source>
          , Springer Berlin Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Donghwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kang</surname>
          </string-name>
          <article-title>, Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec</article-title>
          ,
          <source>Information Sciences 477</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bernhardsson</surname>
          </string-name>
          , Annoy at github, https://github.com/spotify/annoy,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement (v1.0), CoRR (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>