<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Two-Step Method based on Embedding and Clustering to Identify Regularities in Legal Case Judgements</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Graziella</forename><surname>De Martino</surname></persName>
							<email>graziella.demartino@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>Via Orabona, 4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gianvito</forename><surname>Pio</surname></persName>
							<email>gianvito.pio@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>Via Orabona, 4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="laboratory">Big Data Laboratory</orgName>
								<orgName type="institution">National Interuniversity Consortium for Informatics</orgName>
								<address>
									<addrLine>Via Ariosto, 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Rome</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Michelangelo</forename><surname>Ceci</surname></persName>
							<email>michelangelo.ceci@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>Via Orabona, 4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="laboratory">Big Data Laboratory</orgName>
								<orgName type="institution">National Interuniversity Consortium for Informatics</orgName>
								<address>
									<addrLine>Via Ariosto, 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Rome</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Department of Knowledge Technologies</orgName>
								<orgName type="institution">Jožef Stefan Institute</orgName>
								<address>
									<addrLine>Jamova cesta 39</addrLine>
									<postCode>1000</postCode>
									<settlement>Ljubljana</settlement>
									<country key="SI">Slovenia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">De</forename><surname>Martino</surname></persName>
						</author>
						<title level="a" type="main">A Two-Step Method based on Embedding and Clustering to Identify Regularities in Legal Case Judgements</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">B4A6871567F4B0DB87145785FEFD776A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:52+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Legal Information Retrieval</term>
					<term>Embedding</term>
					<term>Clustering</term>
					<term>Approximate Nearest Neighbor Search</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In an era characterized by fast technological signs of progress that introduce new scenarios every day, working in the law field may appear very difficult if not supported by the right tools. In this paper, we discuss a recently submitted work that proposes a novel method, called PRILJ, that identifies paragraph regularities in legal case judgments to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embeddingbased methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs with respect to the paragraphs of a document under preparation. Our extensive experimental evaluation, performed on a real-world dataset, proves the effectiveness and the efficiency of the proposed method even if documents contain noisy data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The legal sector is generally characterized by a slow response to the new scenarios that appear every day in the modern society. In this context, Artificial Intelligence (AI) methods can support the design of advanced (also automated) solutions to improve the efficiency of the processes in this field. Among the attempts in this direction, we can mention the work presented in <ref type="bibr" target="#b0">[1]</ref>, where the authors applied AI techniques to measure the similarity among legal case documents, that can be useful to speed up the identification and analysis of judicial precedents. Another relevant example is the work in <ref type="bibr" target="#b1">[2]</ref>, where the authors consider the semi-automation of some legal tasks, such as the prediction of judicial decisions of the European Court of Human Rights.</p><p>Following this line of research, in this discussion paper, we describe a novel method, called PRILJ, that identifies paragraph regularities in legal case judgements, to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embedding-based methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs. Therefore, given a (possibly incomplete or under preparation) document, henceforth called target document, PRILJ supports the retrieval of similar paragraphs appearing in a set of reference documents related to previous transcribed legal case judgments.</p><p>Document clustering has received a lot of attention by the research community, but together with the design of advanced algorithms <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>, the most critical aspect is in the design of a proper representation of the objects/items at hand <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>, as well as of similarity measures. In the literature we can find several document similarity measures implemented through a) network-based approaches <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>, b) text-based methods <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b0">1]</ref> or c) hybrid approaches <ref type="bibr" target="#b10">[11]</ref>.</p><p>In this context, PRILJ has the main advantage of properly combining embedding methods, to catch the semantics, with a two-step approach, that consists in learning a different representation for each group of documents, rather than one single model. This aspect allows us to capture peculiarities of paragraphs according to the specific topic represented by each cluster of documents.</p><p>Our extensive experimental evaluation, performed on a real-world dataset, proves the effectiveness and the efficiency of the proposed method. In particular, its ability of modeling different topics of legal documents, as well as of capturing the semantics of the textual content, appear very beneficial for the considered task, and make PRILJ very robust to the possible presence of noise in the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Method</head><p>Before describing PRILJ, in the following, we provide some useful definitions:</p><p>• Training set 𝐷 𝑇 : a collection of legal judgments, represented as textual documents, adopted to train our models; • Reference set 𝐷 𝑅 : a collection of legal judgments, represented as textual documents, from which we are interested to identify paragraph regularities; • Target document 𝑑: a legal judgment (possibly under preparation) about which we are interested to identify paragraph regularities from 𝐷 𝑅 .</p><p>The training set and the reference set may fully (or partially) overlap i.e., 𝐷 𝑇 = 𝐷 𝑅 (or 𝐷 𝑇 ∩ 𝐷 𝑅 ̸ = ∅), namely, the set of documents adopted to train our models may be the same as (or overlap with) the collection from which we want to identify paragraph regularities with respect to the target document. Note that PRILJ is fully unsupervised and the target document 𝑑 is never contained in either the training set or in the reference set (i.e., 𝑑 / ∈ (𝐷 𝑇 ∪ 𝐷 𝑅 )). The three phases of PRILJ are detailed in the following subsections. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Training phase</head><p>As shown in Fig. <ref type="figure" target="#fig_0">1</ref>, PRILJ starts with the application of some pre-processing steps to the documents in 𝐷 𝑡 . Specifically, the pre-processing consists of: i) lowercasing the text, ii) removing punctuation and digits, iii) applying lemmatization, and iv) removing rare words. The pre-processed documents are then used to train a document embedding model 𝑀 , that is subsequently exploited to represent each document of the training set 𝐷 𝑇 in the latent feature space, obtaining the set of embedded training documents 𝐸 𝑇 . Such documents are then partitioned into 𝑘 clusters [𝐶 1 , 𝐶 2 , ..., 𝐶 𝑘 ] by adopting the 𝑘-means clustering algorithm. Each cluster of documents becomes the input for a further learning step at the paragraph level: documents falling in the same cluster will contribute to the learning of a specific paragraph embedding model. Algorithmically, for each document cluster 𝐶 𝑖 , 1 ≤ 𝑖 ≤ 𝑘, we extract the paragraphs (i.e., sentences delimited by a full stop) from the documents falling into 𝐶 𝑖 and train a paragraph embedding model 𝑃 𝑖 . This approach allows us to learn more specific paragraph embedding models, according to the topic possibly represented by the identified clusters.</p><p>The embedding models, both at the document level and at the paragraph level, are learned by PRILJ through neural network architectures based on Word2Vec Continuous-Bag-of-Words (CBOW) <ref type="bibr" target="#b6">[7]</ref> or Doc2Vec <ref type="bibr" target="#b7">[8]</ref> distributed memory distributed memory (PV-DM). This choice is motivated by the fact that previous works demonstrated the superiority of Word2Vec and Doc2Vec over classical counting-based approaches, since they take into account both the syntax and semantics of the text <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b0">1]</ref>. In addition, their ability to catch the semantics and the context of single words and paragraphs allow them to properly represent new (previously unseen) documents which features have not been explicitly observed during the training phase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Paragraph embedding of the reference set</head><p>In Fig. <ref type="figure" target="#fig_1">2</ref>, we show the workflow followed by PRILJ to represent the paragraphs of the documents belonging to the reference set into a latent feature space. Analogously to the training phase, we pre-process the documents of the reference set 𝐷 𝑟 . Then, each document of the reference set is embedded using the previously learned document embedding model 𝑀 . The embedded representation of the document is then used to identify the closest document cluster that corresponds the optimal paragraph embedding model (i.e., 𝑃 𝑐 ). We stress the fact that PRILJ The set of all the embedded paragraphs 𝐸 𝑅 is finally returned. Paragraph regularities for a given target document 𝑑 will be identified from such set 𝐸 𝑅 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Identification of paragraph regularities</head><p>The final phase, which workflow is represented in Fig. <ref type="figure" target="#fig_2">3</ref>, starts by following the same steps mentioned in Sec. 2.2 to represent each paragraph of the target document 𝑑 in the paragraph embedding space. Specifically, the most proper paragraph embedding model is adopted to embed its paragraphs, selected by identifying the closest document cluster with respect to 𝑑. For each embedded paragraph, we finally identify the top-𝑛 most similar paragraphs from the set of embedded paragraphs 𝐸 𝑅 belonging to the reference set.</p><p>It is noteworthy that their identification could straightforwardly be based on the computation of vector-based similarity/distance measures (e.g., cosine similarity, Euclidean distance, etc.) between the embedded paragraphs of the target document 𝑑 and all the embedded paragraphs of the reference set 𝐸 𝑟 . Such a pairwise comparison would be computational intensive and would lead to inefficiencies during the adoption of the proposed system in a real-world scenario. To overcome this issue, we adopt a more advanced method for the identification of the top-𝑛 most similar paragraphs, based on random projections. In particular, we propose an approach based on Annoy <ref type="bibr" target="#b12">[13]</ref>, where the idea is to perform an approximated nearest neighbour search (ANNS), consisting in two phases: index construction on the paragraphs of the reference set, and search, that occurs when we actually need to identify the top-𝑛 most similar paragraphs with respect to a paragraph of the target document. During the index construction, we build 𝑇 binary trees, where each tree is built by partitioning the input set of vectors recursively, by randomly selecting two vectors and defining a hyperplane that is equidistant from them. It is noteworthy that even if based on a random partitioning, vectors that are close to each other in the feature space are more likely to appear close to each other in the tree. During the search process, a priority queue is exploited, and each tree is recursively traversed, where the priority of each split node is defined according to the distance to the query vector (that is a paragraph of the target document, in our case). This process leads to the identification of 𝑇 leaf nodes, where the query vector falls into. The distance between the query vector and the set of vectors falling into the identified leaves is finally exploited to return the top-𝑛 most similar paragraphs <ref type="bibr" target="#b13">[14]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>All the experiments were performed using a real-world dataset consisting of 4,181 official public EU legal documents, provided by EUR-Lex (https://eur-lex.europa.eu/homepage.html), in a 10-fold cross-validation setting. All the documents of the testing set were considered as target documents, while the reference set was built by constructing 20 replicas of each paragraph of the documents in the testing set, perturbed by introducing a controlled amount of noise. In particular, noise was introduced by replacing a given percentage of words of each paragraph by random words selected from the Oxford dictionary (raw.githubusercontent.com/cduica/ Oxford-Dictionary-Json/master/dicts.json). In our experiments, we considered different levels of noise, namely, 10%, 20%, 30%, 40%, 50% and 60%, in order to evaluate the robustness of the proposed approach to different amounts of noise.</p><p>In order to assess the specific contribution of the adopted embedding strategies, we compared the results obtained through Word2Vec and Doc2Vec with those achieved using a baseline strategy, i.e., the classical TF-IDF. In all the cases, we adopted a 50-dimensional feature vector. Note that we use 50 features, since it is a commonly used dimensionality in other pre-trained embedding models. For TF-IDF, we selected the top-50 words showing the highest frequency across the set of legal judgments.</p><p>We specifically evaluated the contribution of the two-step model implemented in PRILJ with different numbers of clusters, i.   Finally, we evaluated the effectiveness and the efficiency of the approach implemented in PRILJ for the identification of the 𝑡𝑜𝑝-n most similar paragraphs based on ANNS (with 𝑇 = 100). Specifically, we performed an additional comparative analysis against a non-approximated solution based on the cosine similarity, on a subset of 100 documents randomly selected from the dataset. This analysis was performed considering the best number of clusters 𝑘, and also focused on evaluating the advantages in terms of computational efficiency.</p><p>As evaluation measures, we collected precision@n, recall@n and f1-score@n, averaged over the paragraphs of target documents and over the 10 folds, with 𝑛 ∈ {5, 10, 15, 20, 50, 100}. Specifically, for each paragraph of a target document in the testing set, we considered as True Positives the number of correctly retrieved (perturbed) replicas from the reference set. Note that, in this discussion paper, for space constraints we only show the results in terms of f1-score@20. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ANNS Cosine</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Results</head><p>In Fig. <ref type="figure" target="#fig_4">4</ref> we can observe that, although the baseline based on TF-IDF obtained acceptable results, the adoption of the embedding methods implemented in PRILJ is significantly beneficial. Moreover, although Doc2Vec is natively able to work with word sequences, Word2Vec always obtains better results. This is possibly due to the fact that several paragraphs of different legal documents may share a similar topic, and the adoption of the unique sequence ID to associate the context with the document, as done by Doc2Vec (see <ref type="bibr" target="#b7">[8]</ref> for details), may lead to overfitting issues. In Fig. <ref type="figure" target="#fig_5">5</ref>, it is possible to clearly observe the contribution of the two-step process we propose. Indeed, the results show that the proposed two-step model outperforms the one-step model, in all the situations. In particular, the two-step model is much more robust to the presence of noise: although we can still observe a decrease when the noise amount increases, its impact is much less evident. We can also observe that in general, the number of extracted cluster 𝑘 seems to not significantly affect the results, even if the best results are observed with 𝑘 = √︀ |𝐷 𝑇 | • 2. This means that the documents are distributed among several topics and that learning a different (more specialized) paragraph embedding model for each of them is helpful to retrieve significant paragraph regularities.</p><p>Finally, the comparison between the adopted ANNS and the exact computation of the cosine similarity emphasized a difference of 0.6% in terms of f1-score@n, which can be considered negligible. On the other hand, the advantage in terms of efficiency is significant: the exact search required up to 1000x the time took by the ANNS implemented in PRILJ (see Table <ref type="table">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions</head><p>In this work, we discussed PRILJ, a novel approach to identify paragraph regularities in legal judgments. PRILJ represents documents and paragraphs thereof in a numerical feature space by exploiting embedding methods able to catch the context and the semantics. Moreover, PRILJ is based on a two-step model, that groups similar documents into clusters and, for each of them, learns a specific paragraph embedding model. This approach allows us to properly catch peculiarities exhibited by paragraphs and documents of similar topics and to handle the presence of noise in a robust manner. Finally, PRILJ is able to identify paragraph regularities very efficiently, thanks to an ANNS strategy.</p><p>Our extensive experimental evaluation has shown the accuracy and the efficiency of the developed approach on real data. This means that PRILJ can be considered a useful tool in real-world scenarios, also when large collections of legal documents have to be analyzed.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Graphical overview of the training phase. Green-and red-dotted rectangles represent inputs and outputs, respectively.</figDesc><graphic coords="3,89.29,84.19,416.70,119.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Graphical overview of the paragraph embedding of the reference set. Green-and red-dotted rectangles represent inputs and outputs, respectively.</figDesc><graphic coords="4,89.29,84.18,416.69,146.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Graphical overview of the identification of paragraph regularities. Green-and red-dotted rectangles represent inputs and outputs, respectively.</figDesc><graphic coords="5,98.46,84.19,395.86,178.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>e., 𝑘 ∈ { √︀ |𝐷 𝑇 |/2, √︀ |𝐷 𝑇 |, √︀ |𝐷 𝑇 | • 2}, and compared the observed performance with that obtained without grouping training documents into clusters (henceforth denoted as one-step model).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: F1-score@20 results obtained when using TF-IDF, Doc2Vec or Word2Vec as embedding strategies, with the two-step model (𝑘 = √︀ |𝐷 𝑇 | • 2).</figDesc><graphic coords="6,140.13,84.19,312.54,181.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: F1-score@20 results obtained with the two-step model (with different values of 𝑘) and with the one-step model. As embedding strategy, we considered Word2Vec.</figDesc><graphic coords="6,140.13,313.12,312.53,158.73" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>GP acknowledges the support of Ministry of Universities and Research through the project "Big Data Analytics", AIM 1852414-1 (line 1).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Measuring similarity among legal court case documents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mandal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Saha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ghosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghosh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 10th Annual ACM India Compute Conference</title>
				<meeting>of the 10th Annual ACM India Compute Conference</meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Using machine learning to predict decisions of the european court of human rights</title>
		<author>
			<persName><forename type="first">M</forename><surname>Medvedeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vols</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wieling</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence and Law</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Survey of clustering data mining techniques, A Survey of Clustering Data Mining Techniques</title>
		<author>
			<persName><forename type="first">P</forename><surname>Berkhin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page">10</biblScope>
		</imprint>
	</monogr>
	<note>Grouping Multidimensional Data: Recent Advances in Clustering</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A density-based algorithm for discovering clusters in large spatial databases with noise</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-P</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD&apos;96</title>
				<meeting>of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD&apos;96</meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page" from="226" to="231" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Hierarchical and Overlapping Co-Clustering of mRNA: miRNA Interactions</title>
		<author>
			<persName><forename type="first">G</forename><surname>Pio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ceci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Loglisci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>D'elia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Malerba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Frontiers in Artificial Intelligence and Applications</title>
				<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">242</biblScope>
			<biblScope unit="page" from="654" to="659" />
		</imprint>
	</monogr>
	<note>ECAI 2012</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">DENCAST: distributed density-based clustering for multi-target regression</title>
		<author>
			<persName><forename type="first">R</forename><surname>Corizzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ceci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Malerba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Big Data</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page">43</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Distributed representations of sentences and documents, 31st</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page">4</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Similarity analysis of legal judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K</forename><surname>Reddy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">B</forename><surname>Reddy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th Bangalore Annual Compute Conference, Compute 2011</title>
				<meeting>the 4th Bangalore Annual Compute Conference, Compute 2011<address><addrLine>Bangalore, India</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">March 25-26, 2011. 2011</date>
			<biblScope unit="page">17</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Finding relevant indian judgments using dispersion of citation network</title>
		<author>
			<persName><forename type="first">A</forename><surname>Minocha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Srivastava</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th International Conference on World Wide Web</title>
				<meeting>the 24th International Conference on World Wide Web</meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1085" to="1088" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Finding similar legal judgements under common law system</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K</forename><surname>Reddy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">B</forename><surname>Reddy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Databases in Networked Information Systems</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="103" to="116" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec</title>
		<author>
			<persName><forename type="first">K</forename><surname>Donghwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Seo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Sciences</title>
		<imprint>
			<biblScope unit="volume">477</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Bernhardsson</surname></persName>
		</author>
		<ptr target="https://github.com/spotify/annoy" />
		<title level="m">Annoy at github</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
		<title level="m">Approximate nearest neighbor search on high dimensional data -experiments, analyses, and improvement</title>
				<imprint>
			<publisher>CoRR</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
