<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Clustering Amendments with Semantic Embeddings</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Sajeva</surname></persName>
							<email>alessandro.sajeva@uniroma3.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università degli Studi</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stefano</forename><surname>Iannucci</surname></persName>
							<email>stefano.iannucci@uniroma3.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università degli Studi</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Carlo</forename><surname>Marchetti</surname></persName>
							<email>carlo.marchetti@senato.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Senato della Repubblica Italiana</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Merialdo</surname></persName>
							<email>paolo.merialdo@uniroma3.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università degli Studi</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Riccardo</forename><surname>Torlone</surname></persName>
							<email>riccardo.torlone@uniroma3.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università degli Studi</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Roma</forename><surname>Tre</surname></persName>
						</author>
						<title level="a" type="main">Clustering Amendments with Semantic Embeddings</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D20E8DD3FA9711D08137DD9FF3370994</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:07+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>language models, embeddings, clustering, public affairs Orcid 0009-0000-9419-3015 (A. Sajeva)</term>
					<term>0000-0001-7485-9772 (S. Iannucci)</term>
					<term>0000-0002-3852-8092 (P. Merialdo)</term>
					<term>0000-0003-1484-3693 (R. Torlone)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Italian Senate faces the problem of clustering amendments to optimize the scheduling of parliamentary sessions. Currently, this task is carried out by Similis, an application that tackles this problem by using a traditional term-frequency technique, which leads to clustering based on wording rather than semantics. Recent advances in natural language processing have led Italian institutions to investigate the adoption of pre-trained language models (PTLMs) for text analysis. Along this line, in this paper, we propose CLAMSE, an alternative system to Similis that uses Sentence-BERT pre-trained models to generate embeddings and then groups similar amendments through hierarchical agglomerative clustering. Our preliminary evaluation shows that CLAMSE achieves comparable performance to Similis using embeddings generated by pre-trained models without fine-tuning, paving the way for applying a clustering method with advanced contextual understanding. This study contributes to enhancing the effectiveness of institutional decision-making processes through the adoption of PTLMs.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Pre-Trained Language Models (PTLMs) are emerging as valuable allies in addressing various problems across diverse domains and represent a great opportunity for enhancing parliamentary efficiency and effectiveness. In this context, and within the framework of a collaboration between Roma Tre University and the Italian Senate, an interest in systems based on PTLMs to support parliamentary activities has taken shape. This paper investigates the problem of clustering similar amendments.</p><p>Amendments represent proposed changes to legislative texts and are a fundamental element of the legislative process. They may vary widely in terms of wording but often share similar intentions and objectives. Similar amendment proposals should be discussed simultaneously, if possible. Therefore, clustering amendments according to their similarity is an essential activity to facilitate the work of officials and effectively organize voting sessions, while ensuring the coherence and completeness of legislative proposals. Indeed, amendments that differ by only a few words are usually proposed in large numbers by parliamentary groups that want to filibuster, with the aim of slowing down the legislative process. Combining debates on similar amendment proposals therefore allows for the greatest possible efficiency in terms of time.</p><p>In this paper, we aim to explore the potential of PTLMs in such a crucial context. The Senate already has a tool to support this activity, called Similis, which adopts a traditional term-frequency technique. Although Similis is effective in grouping short amendments sharing many tokens, it loses effectiveness with longer amendments that adopt different lexicons yet preserve the same semantics.</p><p>To overcome this issue we have investigated an alternative solution that leverages a PTLM. Our approach has been implemented in CLAMSE (Clustering Amendments with Semantic Embeddings), an alternative system to Similis, which relies on Sentence-BERT <ref type="bibr" target="#b0">[1]</ref> PTLMs for converting amendments into embeddings, and then groups similar embeddings via a hierarchical agglomerative clustering (HAC) technique.</p><p>The preliminary experiments we conducted yielded promising results showing the effectiveness of adopting solutions based on PTLMs in demanding processes of public administration, with the advantage of simplifying the development process by eliminating the need for building from-scratch implementations or task-specific models.</p><p>The rest of the paper is organized as follows. In Section 2 we discuss related works and the pre-existent approach to amendments clustering. In Section 3 we illustrate our PTLM solution and in Section 4 we report on its evaluation. Finally, in Section 5 we draw some conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work and earlier solution</head><p>In recent years, there has been a growing emphasis on leveraging advanced text processing techniques to enhance institutional work globally. In the context of a collaboration between Roma Tre University and the Senate of the Italian Republic, and more generally in the application of artificial intelligence systems to support legislative activities, several important studies have been carried out. A machine learning system for the classification of documents has been developed <ref type="bibr" target="#b1">[2]</ref>, and another important contribution concerned the implementation of a system that exploits Sentence-BERT <ref type="bibr" target="#b0">[1]</ref> for the alignment of stenographic reports with audio recordings of parliamentary sessions <ref type="bibr" target="#b2">[3]</ref>, representing steps forward in the digital transformation of legislative practices.</p><p>A pivotal aspect of this evolution lies in the transition from traditional vectorization methods towards more semantic-based approaches. In the domain of text clustering, Term Frequency-Inverse Document Frequency (TF-IDF) stands out as one of the most commonly employed methods for representing textual data. However, TF-IDF fails to incorporate the positional and contextual aspects of words within sentences. A study conducted simulations to demonstrate that BERT consistently outperforms the TF-IDF method <ref type="bibr" target="#b3">[4]</ref>. Furthermore, recent studies have employed Sentence-BERT as a vectorization technique before clustering analysis <ref type="bibr" target="#b4">[5]</ref>. Comparisons have been made between Sentence-BERT and traditional methods. Sentence-BERT emerged as particularly adept at understanding topics within clustering algorithms <ref type="bibr" target="#b5">[6]</ref>. Other research works also have showcased the effectiveness of semantic embeddings derived from foundation models in clustering endeavors in the realm of dataset deduplication <ref type="bibr" target="#b6">[7]</ref>.</p><p>This transition towards semantic representations marks a paradigm shift from shallow representations incorporating syntactic meaning to deep context understanding, providing a more sophisticated text comprehension.</p><p>Similis. The IT department of the Italian Republic Senate has developed a solution for clustering similar amendments, called Similis, in close collaboration with the Institute of Legal Informatics and Judicial Systems (CNR-IGSG) <ref type="bibr" target="#b7">[8]</ref>. The similarity sought by Similis focuses on the wording of sentences, thus favoring texts with high syntactic rather than semantic coherence. The algorithm does not rely on a priori information about amendments, and groups amendments by means of HAC with complete linkage <ref type="bibr" target="#b8">[9]</ref>. The content of each amendment is represented by a vector which is built according to a bag-of-word model with term-frequency (TF) with Euclidean normalization <ref type="bibr" target="#b9">[10]</ref>, after word stemming and stop-words removal. Amendment similarity is measured using a traditional cosine similarity metric. The thresholds for the cosine similarity and dendrogram cut-off have been identified empirically and are set to 0.8 and 0.2, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">CLAMSE</head><p>Our semantic embeddings-based solution consists of a pipeline, which involves three phases: (𝑖) preprocessing, (𝑖𝑖) encoding, and (𝑖𝑖𝑖) clustering.</p><p>The system takes a set of amendments that refer to a single Senate act as input. The preprocessing phase aims at cleaning the text of each amendment from special characters, thereby increasing the quality of embeddings and, consequently, improving clustering performance. The pre-processed text is then sent to the encoding block, which computes an embedding for each amendment. It is worth saying that Sentence-BERT models are implemented in a Python framework known as SentenceTransformers. This provides several PTLMs, each with specific characteristics. The encodings produced by each model serve as input for the clustering phase that implements a traditional HAC. The final solution for the problem is determined by selecting the most optimal clustering among those generated by applying HAC to each corpus of embeddings obtained in the encoding phase.</p><p>In the following, we provide a more detailed description of the three phases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Preprocessing</head><p>The input dataset, which is structured as a JSON Line (JSONL) file, is transformed into the standard JSON format for readability. The JSON file reports a record containing the text and the amendment number for each amendment. Also, each record is associated with a cluster identifier, which represents the ground truth for a clustering assignment. <ref type="foot" target="#foot_0">1</ref>The text of the amendments includes numerous tags and annotations designed for display on web browsers but lacking any semantic significance. Therefore, it was decided to remove them to enhance clustering performance. In particular, we remove all the occurrences of the HTML macro for the blank space ("&amp;nbsp"), and we replace all the occurrences of HTML tags with whitespace characters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Amendment encoding with Sentence-BERT</head><p>The cleaned corpus of amendments is passed to the encoding module. Here, a range of Sentence-BERT PTLMs is loaded, and each model generates an embedding for the text of each amendment. Not all of the pre-trained models produce normalized embeddings. However, since normalizing embeddings leads to improved performance, a normalization step is added by dividing each component of the vector by its norm and ensuring that the norm of the resulting vector is equal to 1 (L2 normalization). Table <ref type="table" target="#tab_0">1</ref> reports the PTLMs that we have considered in our experimental activities. Since the dataset used for the experiments contains amendments with an average of 137 tokens, models with a maximum sequence length of less than 512 tokens were excluded from the study.<ref type="foot" target="#foot_1">2</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Clustering algorithm</head><p>In the last phase, a clustering activity is carried out on each corpus of embeddings generated in the previous phase. In particular, we adopt a HAC approach. Roughly speaking, a HAC algorithm operates as follows. Initially, each sample in the input dataset is treated as an individual cluster. Then, the algorithm iterates, merging in each step the most suitable pair of clusters until only one cluster remains. The decision on which pair of clusters to merge is based on the cosine similarity between the embeddings representing the amendments, with complete linkage, i.e., the similarity between clusters equals the similarity between the two most dissimilar samples, one in each cluster. The cut-off threshold for the dendrogram generated by the HAC algorithm is set to a value that maximizes the quality of the clusters. In CLAMSE, we use the silhouette score to measure clustering quality: it measures how similar an element is to members of the cluster it belongs to, compared to other members of other clusters. It ranges from −1 to +1, where a high value indicates that there is cohesion between elements of the same cluster and separation between elements of different clusters. Since we have a clustering solution for each PTLM used to generate the embeddings, we choose the clustering with the highest silhouette score as the final solution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental evaluation</head><p>The evaluation of CLAMSE is carried out in terms of both its absolute performance and in comparison to the pre-existing Similis solution. In particular, we first analyze the performance of the different PTLMs to identify the best solution that CLAMSE could achieve. Then, we compare CLAMSE to Similis. Finally, we present the results of an experiment that shows how the preprocessing phase can influence the final clustering results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Ground truth</head><p>We evaluated CLAMSE on a set of amendments, named "Act 1248". It contains 1261 amendments that have been clustered manually (each amendment, in the original JSONL file, is annotated with a label representing the target cluster). Table <ref type="table" target="#tab_1">2</ref> reports the main characteristic of this dataset. Note there is a large variation in both the size of amendments occurring in such dataset (ranging from a few to thousands of tokens) and the number of elements occurring in a cluster (ranging from a singleton to tens of amendments). This makes the task of automatic amendments clustering challenging.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Evaluation metrics</head><p>The performance of Similis is reported by means of the Adjusted Rand-Index (ARI) <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref> and the Adjusted Mutual Information (AMI) <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>, which are standard metrics to evaluate the results of a clustering process.</p><p>ARI assesses the agreement between the clusters produced by a clustering algorithm and the ground truth, correcting for chance agreement. AMI is another measure used to evaluate the similarity between two clusterings of data, but it is based on mutual information. Both ARI and AMI are used to measure the similarity between the true labels in the ground truth and the clustering labels produced by CLAMSE, taking into account the possible presence of random agreements. Both metrics range from 0 to 1, where a score of 1 indicates that true labels and clustering labels are identical, and a score of 0 indicates that they are completely different.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1-9</head><p>ARI is more suitable when the ground truth clustering has large equal-sized clusters. AMI is preferable when the ground truth clustering is unbalanced and there exist small clusters <ref type="bibr" target="#b15">[16]</ref>.</p><p>As above mentioned, silhouette is another metric for evaluating the quality of clustering. However, in our solution, it is used internally to the HAC algorithm to find the optimal number of clusters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Comparison with Similis</head><p>The performances of CLAMSE compared to Similis are reported in Table <ref type="table" target="#tab_2">3</ref>, which shows that CLAMSE achieves better results both for ARI and AMI.</p><p>Table <ref type="table" target="#tab_2">3</ref> reports also the performance of CLAMSE without the preprocessing phase. It is worth observing how cleaning text contributes to improved performance as the semantics of the amendments can thus be better understood by the model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Robustness of CLAMSE</head><p>As we discussed in Section 3.2, CLAMSE utilizes several PTLMs from the Sentence-BERT library, known as SentenceTransformers, and chooses the solution that produces the best clustering based on the silhouette score. In order to evaluate the robustness of the approach, for each Sentence-BERT PTLM we have computed the best clustering that could be obtained from the HAC in terms of ARI and AMI. Essentially, at every iteration of the HAC algorithm, we compute ARI and AMI, which need the ground truth. The highest AMI and AMI scores represent the best performance that can be achieved by the CLAMSE approach, which chooses the best clustering of each PTLM based on the silhouette scores.</p><p>Table <ref type="table" target="#tab_3">4</ref> shows, sorted by descending AMI, the best performances achievable by CLAMSE with each of the models in Tables 1. Observing the results of this experiment, two important observations occur. First, notice that in many cases, the CLAMSE algorithm achieves the best results it could obtain. This demonstrates the effectiveness of the silhouette score for choosing the best clustering of each model. Secondly, it's important to note that while several models outperform Similis, there are also many models that exhibit inferior performance. This underscores the effectiveness of our approach in model selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>We presented CLAMSE, a system that addresses the problem of amendments clustering. CLAMSE applies a HAC algorithm to the embeddings of the amendments built by several 0,41720 0,75333 0,75333 0,63057 0,63057 PMB2 0,42515 0,73485 0,74161 0,58938 0,59005 MQMBD1 0,40534 0,70468 0,73968 0,57445 0,58206 AMB1 0,39858 0,73818 0,73865 0,59765 0,61516 GTXL 0,41664 0,67903 0,70897 0,46131 0,50561</p><p>Sentence-BERT PTLMs. This is a preliminary study whose main objective was to explore and evaluate the application of embeddings generated by PTLMs, in particular Sentence-BERT, in comparison to the traditional approach based on TF implemented by the existing system Similis.</p><p>The preliminary results are encouraging and show that CLAMSE, without fine-tuning, achieved interesting performance. While these results are promising, several limitations should be noted. First, due to the constraints of using models with a maximum input length of 512 tokens, amendments longer than this threshold are truncated, accounting for approximately 3.33% of the dataset. Second, the encoding is based solely on the textual content of the amendments, ignoring the specific articles of the law to which they refer. As a result, amendments with identical text but referencing different articles may be grouped together in cases where they should not be, potentially compromising the accuracy of the clustering results. In addition, the computational efficiency of the method is a concern, as it requires encoding the entire corpus with multiple models and running HAC each time to determine the optimal clustering result.</p><p>This evaluation suggests a promising potential for the future development of CLAMSE. In particular, the possibility of training specific models for amendment clustering could be explored, taking advantage of the pre-trained models that already show good efficiency in terms of semantic embedding. This approach may represent an interesting research direction to further optimize the performance of the system and refine its adaptability to new datasets of amendments to be clustered.</p><p>It is worth observing that Similis can leverage Linkoln <ref type="bibr" target="#b16">[17]</ref>, a system for the automatic extraction of legislative and jurisprudential references from texts in the Italian language. In the future, we plan to study the opportunity of enhancing CLAMSE with the text annotation provided by Linkoln. There exists a pronounced enthusiasm for large language models (LLMs) over traditional PTLMs. Numerous governments are actively experimenting with LLMs for classification tasks and question-answering <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19]</ref>. Our future endeavors will focus on harnessing the power of LLMs to enhance the performance of CLAMSE.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Sentence-BERT PTLMs used in our experimental activity: acronym, extended name, size (in MB), dimension of the embedding, and max sequence length</figDesc><table><row><cell>model</cell><cell>extended name</cell><cell cols="3">size dimension max sequence</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>length</cell></row><row><cell>AMB1</cell><cell>all-mpnet-base-v1</cell><cell>420</cell><cell>768</cell><cell>512</cell></row><row><cell>GTXL</cell><cell>gtr-t5-xl</cell><cell>2370</cell><cell>768</cell><cell>512</cell></row><row><cell cols="3">MQMBC1 multi-qa-mpnet-base-cos-v1 420</cell><cell>768</cell><cell>512</cell></row><row><cell cols="3">MQMBD1 multi-qa-mpnet-base-dot-v1 420</cell><cell>768</cell><cell>512</cell></row><row><cell>PMB2</cell><cell>paraphrase-mpnet-base-v2</cell><cell>420</cell><cell>768</cell><cell>512</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell>ground truth datasets "Act 1248"</cell></row><row><cell>Number of amendments 1.261</cell></row><row><cell>Average number of tokens per amendment 137.1</cell></row><row><cell>Max number of tokens 6,756</cell></row><row><cell>Min number of tokens 3</cell></row><row><cell>Number of clusters 664</cell></row><row><cell>Average number of amendments per cluster 1.899</cell></row><row><cell>Size of the largest cluster 27</cell></row><row><cell>Size of the smallest cluster 1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Comparison CLAMSE with Similis on Act 1248</figDesc><table><row><cell>model</cell><cell>AMI</cell><cell>ARI</cell></row><row><cell cols="3">CLAMSE 0,73485 0,58938</cell></row><row><cell cols="3">Similis 0,71476 0,34511</cell></row><row><cell cols="3">CLAMSE (no preprocessing) 0,54369 0,32997</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Silhouette, ARI and AMI reached through silhouette optimization on Act 1248, sorted by best AMI</figDesc><table><row><cell>best</cell><cell>best</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">In the original JSON file, these elements correspond to the 'num_em', 'text_emend', and 'id_cluster' attributes</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">For a complete list of the models see https://www.sbert.net/docs/pretrained_models.html.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Sentence-bert: Sentence embeddings using siamese bert-networks</title>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.10084</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Multi-label classification of bills from the italian senate</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Angelis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">D</forename><surname>Cicco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lalle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Marchetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merialdo</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:254234138" />
	</analytic>
	<monogr>
		<title level="m">AIxPA@AI*IA</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Enhancing accessibility of parliamentary video streams: Ai-based automatic indexing using verbatim reports</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bertillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Donato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Marchetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merialdo</surname></persName>
		</author>
		<idno>no. 10892</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
			<publisher>EasyChair</publisher>
		</imprint>
	</monogr>
	<note type="report_type">EasyChair Preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">The performance of bert as data representation of text clustering</title>
		<author>
			<persName><forename type="first">A</forename><surname>Subakti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Murfi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hariadi</surname></persName>
		</author>
		<idno type="DOI">10.21203/rs.3.rs-940164/v1</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Density based spatial clustering of applications with noise and sentence bert embedding for indonesian utterance clustering</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Hasani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Heryadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Arifin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lukas</surname></persName>
		</author>
		<author>
			<persName><surname>Suparta</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCoSITE57641.2023.10127683</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)</title>
				<imprint>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="386" to="391" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Text vectorization techniques for trending topic clustering on twitter: A comparative evaluation of tf-idf, doc2vec, and sentence-bert</title>
		<author>
			<persName><forename type="first">A</forename><surname>Susanto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pradita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Stryadhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Setiawan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hasani</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICORIS60118.2023.10352228</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Abbas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Tirumala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Simig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ganguli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Morcos</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.09540</idno>
		<title level="m">Semdedup: Data-efficient learning at web-scale through semantic deduplication</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Clustering similar amendments at the Italian senate</title>
		<author>
			<persName><forename type="first">T</forename><surname>Agnoloni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Marchetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Battistoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Briotti</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2022.parlaclarin-1.7" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, European Language Resources Association</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Fišer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Eskevich</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Lenardič</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>De</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Jong</forename></persName>
		</editor>
		<meeting>the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, European Language Resources Association<address><addrLine>Marseille, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="39" to="46" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Algorithms for hierarchical clustering: an overview</title>
		<author>
			<persName><forename type="first">F</forename><surname>Murtagh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Contreras</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="86" to="97" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Introduction to information retrieval</title>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Cambridge University Press Cambridge</publisher>
			<biblScope unit="volume">39</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Comparing partitions</title>
		<author>
			<persName><forename type="first">L</forename><surname>Hubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Arabie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of classification</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="193" to="218" />
			<date type="published" when="1985">1985</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">On similarity indices and correction for chance agreement</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Albatineh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Niewiadomska-Bugaj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mihalko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of classification</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="301" to="313" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Objective criteria for the evaluation of clustering methods</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">M</forename><surname>Rand</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Statistical association</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="page" from="846" to="850" />
			<date type="published" when="1971">1971</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Information theoretic measures for clusterings comparison: is a correction for chance necessary?</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">X</forename><surname>Vinh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Epps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bailey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th annual international conference on machine learning</title>
				<meeting>the 26th annual international conference on machine learning</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="1073" to="1080" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Standardized mutual information for clustering comparisons: one step further in adjustment for chance</title>
		<author>
			<persName><forename type="first">S</forename><surname>Romano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bailey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Verspoor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1143" to="1151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Adjusting for chance clustering comparison measures</title>
		<author>
			<persName><forename type="first">S</forename><surname>Romano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">X</forename><surname>Vinh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bailey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Verspoor</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v17/15-627.html" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="1" to="32" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<ptr target="https://linkoln.gitlab.io/" />
		<title level="m">Istituto di Informatica Giuridica e Sistemi Giudiziari (IGSG) del Consiglio Nazionale delle Ricerche (CNR)</title>
				<meeting><address><addrLine>Linkoln</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Peña</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Morales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fierrez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Serna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ortega-Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Puente</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Córdova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Córdova</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-41498-5_2</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-031-41498-5_2" />
		<title level="m">Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs</title>
				<meeting><address><addrLine>Nature Switzerland</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="20" to="33" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Application of large language model in intelligent q&amp;a of digital government</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<idno type="DOI">10.1145/3605801.3605806</idno>
		<idno>doi:10.1145/3605801.3605806</idno>
		<ptr target="https://doi.org/10.1145/3605801.3605806" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 2nd International Conference on Networks, Communications and Information Technology, CNCIT &apos;23</title>
				<meeting>the 2023 2nd International Conference on Networks, Communications and Information Technology, CNCIT &apos;23<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="24" to="27" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
