<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Tagliolato</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Acquaviva</forename><surname>D'aragona</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">CNR -IREA</orgName>
								<address>
									<addrLine>via Corti 12</addrLine>
									<postCode>20133</postCode>
									<settlement>Milano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lorenza</forename><surname>Babbini</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">INFO/RAC UNEP-MAP</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gloria</forename><surname>Bordogna</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">CNR -IREA</orgName>
								<address>
									<addrLine>via Corti 12</addrLine>
									<postCode>20133</postCode>
									<settlement>Milano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Lotti</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">INFO/RAC UNEP-MAP</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Annalisa</forename><surname>Minelli</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">INFO/RAC UNEP-MAP</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Oggioni</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">CNR -IREA</orgName>
								<address>
									<addrLine>via Corti 12</addrLine>
									<postCode>20133</postCode>
									<settlement>Milano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department" key="dep1">/o ISPRA</orgName>
								<orgName type="department" key="dep2">DG-SINA</orgName>
								<address>
									<addrLine>via Vitaliano Brancati 48</addrLine>
									<postCode>00144</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E3232D4C91F922124150391234219503</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Knowledge Hub</term>
					<term>Large Language Models</term>
					<term>Natural Language Queries</term>
					<term>Knowledge graph.1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This contribution outlines the design of a Knowledge Hub of heterogeneous documents related to the Mediterranean Action Plan UNEP-MAP of the United Nations Environment Program <ref type="bibr" target="#b0">[1]</ref>. The Knowledge Hub is intended to serve as a resource to assist public authorities and users with different backgrounds and needs in accessing information efficiently. Users can either formulate natural language queries or navigate a knowledge graph automatically generated to find relevant documents. The Knowledge Hub is designed based on state-of-the-art Large Language Models. (LLMs) A user-evaluation experiment was conducted, testing publicly available models on a subset of documents using distinct LLMs settings. This step was aimed to identify the best-performing model for further using it to classify the documents with respect to the topics of interest.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>This contribution reports the feasibility study carried out for the design of a Knowledge Hub (KH) for accessing documents, which is part of the Knowledge Management Platform (KMaP), a platform constituting the unique access point of all knowledge heritage for the United Nations Environmental Program for the Mediterranean Action Plan (UNEP-MAP) <ref type="bibr" target="#b0">[1]</ref>.</p><p>The KH is conceived as an access point to highly heterogeneous multimedia documents distributed on the Web, among the network of United Nations Environmental Program for the Mediterranean Action Plan, about marine studies, political and economic directives, environmental studies and in general as part of UNEP-MAP protocols and activities. For the nature of the contents dealt with in the documents, the hub constitutes a knowledge base for the stakeholders of the Mediterranean Action Plan: The interested public authorities have users with different background knowledge and needs, including politicians, administrators, environmental scientists, projects leaders and citizens, who need to search as well as to navigate the distributed archive.</p><p>During the use case analysis, carried out by interviews to some potential stakeholders, it was deemed important that the KH would support users to perform searches by formulating queries in natural language, and would guide them in navigating the collection by providing an organized view of the documents into topics of interest <ref type="bibr" target="#b1">[2]</ref>.</p><p>To this aim, main critical aspects had to be considered to provide feasible solutions: the document collection is highly heterogeneous as far as the genre, some being minutes of meetings while 0000-0002-0261-313X (P. Tagliolato); 0000-0003-3302-6891 (L. Babbini); 0000-0002-6775-753X (G. Bordogna); 0000-0002-4837-4357 (A. Lotti); 0000-0003-1772-0154 (A. Minelli); 0000-0002-7997-219X (A. Oggioni) others being scientific reports, with highly variable lengths, some documents being of one page while others being reports of hundred pages, in different languages with varying formats (mostly being in pdf others in html and jpg). Finally, the identification of the topics made during the use case analysis revealed that it is not so easy to tell apart which documents belong to a topic, being some of them at the cross-road of several topics.</p><p>The approach that we deemed flexible to apply for enabling natural language searches was identified as an Information Retrieval system <ref type="bibr" target="#b2">[3]</ref> defined based on Large Language Models (LLMs), and specifically on open source pre-trained LLMs <ref type="bibr" target="#b3">[4]</ref>.</p><p>To aid the organization of the documents into the topics we then retrieved natural language descriptions of the topics by simple keywords and conceived these are natural language queries to be submitted to the collection represented in a continuous bag of words space of a pretrained LLM.</p><p>This way, all documents belongs to the topics with a distinct relevance rank. This allowed to build a knowledge graph in which each node represents the ranked list of a topic and each edge between a pair of nodes represents the fuzzy intersection list of the two ranked lists <ref type="bibr" target="#b4">[5]</ref>.</p><p>A user-evaluation experiment was conducted, testing publicly available LLMs on a subset of documents using distinct settings. This step aimed to identify the best-performing model for further using it for both implementing the information retrieval module answering natural language queries and classifying documents with respect to the topics. The paper reports the steps of design of the KH and its evaluation experiment for selecting the best model to be applied in the future for documents' classification into topics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Knowledge Hub design</head><p>The first activity performed was the harvesting of the documents from several potential sources of interest.</p><p>To this end we relied on the knowledge of a group of experts of the leading institution ISPRA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Harvesting Documents' Collection</head><p>This step was aimed at identifying the documents sources, i.e., the web sites and archives with potentially interesting documents and at carrying out their characterization with respect to some meaningful dimensions <ref type="bibr" target="#b4">[5]</ref>. The documents in these information sources are more than 10000, mainly files, and most of them are in PDF format. Most information sources (20 out of 24) contain documents, and 3 of these resources also share images and tables, while only 3 out of 24 provide geographical layers. As far as the resources are concerned, they are dedicated to 3 themes: law, regulation and management of the sea (13 out of 24), pollution (7) and biodiversity <ref type="bibr" target="#b1">(2)</ref>. Finally, 21 of the classified repositories are open to the public, while the remaining 3 are private or have restricted access. From Regional Marine Pollution Emergency Response Centre for the Mediterranean Sea [6], Regional Activity Centre for Specially Protected Areas [7], Regional Activity Centre for Sustainable Consumption and Production [8], Priority Actions Programme/Regional Activity Centre <ref type="bibr" target="#b5">[9]</ref>, UNEP-MAP library [10] and UNEP library where the author was marked as UNEP-MAP [11], we harvested, through website scraping, all the documents. For document harvesting, some code has been developed both by CNR-IREA in the R language and from INFO-RAC in Python language [26] freely available under GNU GPL license. To share the files produced for the harvesting process, a GitHub repository was created <ref type="bibr">[27]</ref>. The "scraping" folder contains the R and Python scripts developed for scraping, the output of these files is in the "results" folder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Strategies for enabling documents search</head><p>Once the collection was available, the methods of representation and indexing of their content have been selected. It was decided to experiment an up-to-date solution based on state-of-the-art "semantic" indexing methods using continuous bag of words <ref type="bibr" target="#b3">[4]</ref>. By this approach users have complete freedom to formulate natural language queries or keywords' queries. In this case the documents are retrieved if their contents are "semantically" close to those of the query. To this end we experimented several LLMs available publicly on hugging face library <ref type="bibr" target="#b6">[12]</ref>. All these models imply the representation and management of the "semantics" of information in a document corpus which has been provided as training set. It must be pointed out that, in this context, the term "semantics" is improperly used since the LLMs identify regular patterns in texts based on heuristic statistical inference; thus, instead of "semantics", the term "relatedness" would be more appropriate. This way they learn how to predict missing words in a sentence, or how to continue a sentence, or to answer a query, and, finally, to retrieve relevant documents in an ad hoc retrieval task activated by a user query. Such "semantics" models are the most effective in the case one wants a natural language querying interaction, since they can retrieve documents which do not contain the specific query words, but synonymous terms or concepts related with the query concepts. In our context this approach was the most feasible since we did not have available thesauri for expanding the meaning of terms in the documents, being the documents heterogeneous as far as both their themes and genre. To this end, we have chosen pretrained LLMs that have been set up for the ad hoc retrieval task and based on evolutions of BERT, Bidirectional Encoder Representations from Transformers <ref type="bibr" target="#b7">[13]</ref> <ref type="bibr" target="#b8">[14]</ref> which is the Google state-of-the-art model using a transformer architecture <ref type="bibr" target="#b7">[13]</ref>, a deep neural network, with self-attention mechanisms, that allows to keep the context of words into account when creating their representation as embeddings, i.e., as vectors of continuous numeric values in a latent semantic space. Once the LLMs have been selected, we defined the architecture of the KH by specifying the preprocessing phase that our corpus of documents should undergo to become a readable input to the selected models. The formats of the input documents, should be simple text with punctuation marks allowing the identification of single words, i.e., tokens; of sentences, ending with punctuation marks like full stop or semicolon, etc.; and of paragraphs, starting with a new line. So, the non-conforming documents consisting in pdf files had to be "translated" into text. Furthermore, the processing steps have been identified which has implied the selection of the implementation libraries and environment in order to code the whole process. We experimented hybridized techniques, for example, the contents of queries and documents was represented by applying different embedding methods, and the same for the ranking of documents using different similarity measures. Finally, we identified the most suitable open software for the implementation of the components, the indexing, the retrieval and the classification components of the KH. Considering that there are a number of open source IR libraries after a review we selected SentenceTransformer python framework [15] that makes several Hugging face pretrained models available for sentence embeddings, and we exploited also the python library NLTK (Natural Language Toolkit [16]) for managing corpus documents and different tokenization strategies (i.e. the aforementioned subdivisions of documents into chunks, i.e., words, sentences, paragraphs or even n-grams sentences, paragraphs, etc.). For our purposes we deemed meaningful to compute different combinations of pretrained LLMs, documents representations based on different chunks definitions, and matching function either dot product or cosine similarity. Since documents may contain several chunks depending on their length, we experimented several aggregation functions of the chunks relevance scores to compute the overall document relevance score, i.e., the document ranking score. Specifically, we applied a K-NN algorithm aggregation function by increasing the number of the most relevant chunks and by using as metrics the fuzzy document cardinality measure <ref type="bibr" target="#b9">[17]</ref>. We have selected the following pre-trained LLMs based on sentence-transformer architectures: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Documents classification into topics</head><p>As far as the classification of the document corpus into the topics, during the use case analysis the topics were first identified by the seven keywords accounted for in the UNESCO thesaurus <ref type="bibr">[23]</ref>, an RDF SKOS concept scheme without definitions, as reported in table <ref type="table">1</ref>.</p><p>Then we identified "definitions" of each topic keyword in renowned and authoritative sources as reported in table 1, i.e., open domain websites, in the form of textual abstracts. We then enriched the pre-existing thesaurus by adding those definitions in the web of data. The result is available both as linked data and through a SPARQL endpoint <ref type="bibr">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1:</head><p>Topics keywords and sources for their definitions as short abstracts</p><p>After choosing the best performing model evaluated as explained in the next section, we applied it to classify the whole collection into the topics, by considering the topics' definitions as queries. This way a document can be assigned to multiple topics to a different extent, where in the extent is the relevance score with respect to a topic. The fuzzy intersection of a pair of ranked lists yielded by two topics (computed by their minimum) is the ranked list of documents at the cross-road of both the topics. This way a knowledge graph can be built in which the nodes are the ranked list of the single topics while the edges are the ranked lists of documents at the crossroad of pairs of topics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">User Evaluation Experiment</head><p>We have set up an evaluation experiment of the different LLMs by randomly selecting a subset of 50 documents of the collection, engaging 3 users with three distinct backgrounds (a physicist, an environmental scientist and a biologist) who read these documents and formulated 10-30 queries each and for each query identified the list of their respective relevant documents among the 50 ones.</p><p>We evaluated some metrics of retrieval effectiveness.</p><p>For our purposes we deemed meaningful to compute mean Average Precision (mAP) <ref type="bibr" target="#b10">[25]</ref> of different combinations of the 5 pretrained LLMs, documents representations based on different chunks definitions, i.e., sentence, fixed window size and paragraphs, and matching functions (cosine similarity and dot product). The results of the mAP for the tests are reported in the following tables. They differ for the computation of similarity. Table <ref type="table" target="#tab_0">2</ref> corresponds to cosine similarity, while Table <ref type="table" target="#tab_1">3</ref> to dot product similarity.  The second parameter "avg" is a Boolean controlling if the relevance score is defined as an average of the chunks' scores (in that case the parameter is used), or if it corresponds to their sum (no indication of the parameter appears). More in detail: "#ch: N (sum)" indicates that the sum of the first N best chunks' scores of each document was computed; "#ch: N (avg)" indicates that the average of the first N best chunks' scores of each document was computed; When N=All it means that all the chunks in the documents are considered. Since documents generally consist of long texts with many chunks we applied also an approach in which the document is represented by a single virtual embedding vector computed as the average of the chunks' vectors. In this case the results of mAP are reported in the column named "Virtual Doc" of Table <ref type="table">1</ref>.</p><p>The last column named "max" reports the best mAP obtained by any of the documents' chunks for the given setting in the row. It can be easily noticed that three distinct models produce the maximum mAP = 0.64 for different settings by using cosine similarity between pairs of embedding vectors. Nevertheless, the most stable model under different input settings (both window and paragraph) and different matching definitions is (b) all-MiniLM-L6-v2. Table <ref type="table" target="#tab_1">3</ref> reports the mAP values when changing the similarity metric by using the dot product. In this case the best performing model is (e) msmarco-distilbertbase-tas-b that, when feed with chunks defined by sentences, reaches mAP = 0.65 when taking into account from 4 to 6 best chunks' relevance scores using both the sum or their average. We thus select this latter model with the setting chunks=sentences, number of chunks per document to consider in the matching from 4 to 6 and either sum of scores or their average.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions</head><p>The originality of the described experience is manifold: first of all, the experimentation of LLMs to index and retrieve a highly heterogeneous collection of documents and their compared evaluation considering different chunk definitions, similarity metrics, and last but not least, by evaluating different aggregation strategies of the chunks relevance scores to compute the overall rank of documents. This last aspect is important in the case the documents are long, consisting of many chunks as in our case.</p><p>A second original contribution is the classification of the documents into "fuzzy" overlapping topics, according to a textual description of each topic which is used as a natural language query to retrieve the ranked list of documents belonging to the topic to a given extent. This approach has been deemed feasible to be applied for the implementation of the KH in order to provide public authorities with a tool that can aid them in searching all documentation they need for the UNEP-MAP program.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Ital-IA 2024: 4th National Conference on Artificial Intelligence, organized by CINI, May 29-30, 2024, Naples, Italy* Corresponding author. † These authors contributed equally. paolo.tagliolatoacquavivadar@ cnr.it (F. Author);</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>(a) msmarco-distilbert-cos-v5 [18]: it maps sentences &amp; paragraphs to a 768-dimensional dense vector space and was designed for semantic search. It has been trained on 500k (query, answer) pairs from the MS MARCO Passages dataset(Microsoft Machine Reading Comprehension) which is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. (b) all-MiniLM-L6-v2 [19]: it maps sentences &amp; paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. (c) msmarco-roberta-base-ance-firstp [20]: this is a port of the ANCE FirstP Model, which uses a training mechanism to select more realistic negative training instances to the sentencetransformers model: it maps sentences &amp; paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. (d) msmarco-bert-base-dot-v5 [21]: it maps sentences &amp; paragraphs to a 768-dimensional dense vector space and was designed for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset. (e) msmarco-distilbert-base-tas-b [22]: it is a port of the DistilBert TAS-B Model to sentencetransformers model: It maps sentences &amp; paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="4,85.05,494.95,422.96,148.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="5,85.05,108.26,423.55,143.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2 :</head><label>2</label><figDesc>mAP for different LLMs/chunks and cosine similarity</figDesc><table><row><cell>Topics keyword</cell><cell>Definitions</cell></row><row><cell></cell><cell>Source</cell></row><row><cell>Climate change</cell><cell>United Nations</cell></row><row><cell>Marine biodiversity</cell><cell>UN</cell></row><row><cell cols="2">Sustainability and blue economy UN</cell></row><row><cell>Pollution</cell><cell>National</cell></row><row><cell></cell><cell>Geographic</cell></row><row><cell>Marine spatial planning</cell><cell>EU commission</cell></row><row><cell>Fishery and aquaculture</cell><cell>FAO</cell></row><row><cell>Governance</cell><cell>UN Dev. Progr.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3 :</head><label>3</label><figDesc>mAP for different LLMs/chunks and dot-product The first column is the pretrained model used (indicated by the letter used in section 2.2). Second column indicates the chunk type used, either sentence, window/ngram, paragraph; then the size of the input to the model is reported. The other columns report the mAP averaged over all users and all queries by considering different aggregation functions of the chunks relevance scores. Several column names represent the parameters passed to the aggregation function.</figDesc><table /><note>"#ch: &lt;number&gt;" is the parameter controlling the number of the best chunks considered for computing the document ranking score. When &lt;number&gt;=All, it means that all chunks are taken into account.</note></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>The work has been carried out within the UNEP-MAP Program of Work 2022-2023 in the framework of the activity of the Information and Communication Regional Activity Centre (INFO/RAC).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Report 2 -Semantic Information Retrieval -Knowledge Hub</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bordogna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tagliolato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Minelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oggioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Babbini</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.10260195</idno>
		<ptr target="https://doi.org/10.5281/zenodo.10260195" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Survey on supervised machine learning techniques for automatic text classification</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">I</forename><surname>Kadhim</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10462-018-09677-1</idno>
		<ptr target="https://doi.org/10.1007/s10462-018-09677-1" />
	</analytic>
	<monogr>
		<title level="j">Artif Intell Rev</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="273" to="292" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">An Introduction to Information Retrieval</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<ptr target="https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf" />
		<imprint>
			<date type="published" when="2009">2009</date>
			<pubPlace>Cambridge UP</pubPlace>
		</imprint>
	</monogr>
	<note>Online edition</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">A comprehensive survey on pretrained foundation models: A history from bert to chatgpt</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><surname>He</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.09419</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Fuzzy Set Techniques in Information Retrieval</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">H</forename><surname>Kraft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bordogna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pasi</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.8082923</idno>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="https://www.unep.org/unepmap/resources/publications?/resources" />
		<title level="m">PAP</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Debut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chaumond</surname></persName>
		</author>
		<ptr target="https://arxiv.org/pdf/1910.03771.pdf" />
		<title level="m">HuggingFace&apos;s Transformers: State-of-theart Natural Language Processing</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">Ashish</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noam</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Niki</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jakob</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Llion</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aidan</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Łukasz</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Illia</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS&apos;17)</title>
				<meeting>the 31st International Conference on Neural Information Processing Systems (NIPS&apos;17)<address><addrLine>Red Hook, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Curran Associates Inc</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="6000" to="6010" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of NAACL-HLT 2019</title>
				<meeting>of NAACL-HLT 2019</meeting>
		<imprint>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">On the fuzzy cardinality of a fuzzy set</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Yager</surname></persName>
		</author>
		<idno type="DOI">10.1080/03081070500422729</idno>
		<ptr target="https://doi.org/10.1080/03081070500422729" />
	</analytic>
	<monogr>
		<title level="j">International Journal of General Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="191" to="206" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Beitzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">C</forename><surname>Jensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Frieder</surname></persName>
		</author>
		<author>
			<persName><surname>Map</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-0-387-39940-9_492</idno>
		<idno>- 39940-9_492 2009</idno>
		<ptr target="https://doi.org/10.1007/978-0-387" />
		<title level="m">Encyclopedia of Database Systems</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Özsu</surname></persName>
		</editor>
		<meeting><address><addrLine>Boston, MA</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
		<ptr target="https://github.com/INFO-RAC/KMP-library-scraping" />
		<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
