<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Tagliolato Acquaviva D'Aragona</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenza Babbini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gloria Bordogna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lotti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annalisa Minelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Oggioni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR - IREA</institution>
          ,
          <addr-line>via Corti 12, Milano, 20133</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INFO/RAC UNEP-MAP c/o ISPRA, DG-SINA</institution>
          ,
          <addr-line>via Vitaliano Brancati 48, Roma, 00144</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ital-IA 2024: 4th National Conference on Artificial Intelligence</institution>
          ,
          <addr-line>organized by CINI</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This contribution outlines the design of a Knowledge Hub of heterogeneous documents related to the Mediterranean Action Plan UNEP-MAP of the United Nations Environment Program [1]. The Knowledge Hub is intended to serve as a resource to assist public authorities and users with different backgrounds and needs in accessing information efficiently. Users can either formulate natural language queries or navigate a knowledge graph automatically generated to find relevant documents. The Knowledge Hub is designed based on state-of-the-art Large Language Models. (LLMs) A user-evaluation experiment was conducted, testing publicly available models on a subset of documents using distinct LLMs settings. This step was aimed to identify the best-performing model for further using it to classify the documents with respect to the topics of interest.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Hub</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Natural Language Queries</kwd>
        <kwd>Knowledge graph</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This contribution reports the feasibility study
carried out for the design of a Knowledge Hub (KH)
for accessing documents, which is part of the
Knowledge Management Platform (KMaP), a platform
constituting the unique access point of all knowledge
heritage for the United Nations Environmental
Program for the Mediterranean Action Plan
(UNEPMAP) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The KH is conceived as an access point to highly
heterogeneous multimedia documents distributed on
the Web, among the network of United Nations
Environmental Program for the Mediterranean Action
Plan, about marine studies, political and economic
directives, environmental studies and in general as
part of UNEP-MAP protocols and activities. For the
nature of the contents dealt with in the documents,
the hub constitutes a knowledge base for the
stakeholders of the Mediterranean Action Plan: The
interested public authorities have users with different
background knowledge and needs, including
politicians, administrators, environmental scientists,
projects leaders and citizens, who need to search as
well as to navigate the distributed archive.</p>
      <p>
        During the use case analysis, carried out by
interviews to some potential stakeholders, it was
deemed important that the KH would support users to
perform searches by formulating queries in natural
language, and would guide them in navigating the
collection by providing an organized view of the
documents into topics of interest [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>To this aim, main critical aspects had to be
considered to provide feasible solutions: the
document collection is highly heterogeneous as far as
the genre, some being minutes of meetings while
0000-0002-0261-313X (P. Tagliolato); 0000-0003-3302-6891
(L. Babbini); 0000-0002-6775-753X (G. Bordogna);
0000-00024837-4357 (A. Lotti); 0000-0003-1772-0154 (A. Minelli);
00000002-7997-219X (A. Oggioni)
© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
others being scientific reports, with highly variable
lengths, some documents being of one page while
others being reports of hundred pages, in different
languages with varying formats (mostly being in pdf
others in html and jpg). Finally, the identification of
the topics made during the use case analysis revealed
that it is not so easy to tell apart which documents
belong to a topic, being some of them at the cross-road
of several topics.</p>
      <p>
        The approach that we deemed flexible to apply for
enabling natural language searches was identified as
an Information Retrieval system [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] defined based on
Large Language Models (LLMs), and specifically on
open source pre-trained LLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>To aid the organization of the documents into the
topics we then retrieved natural language
descriptions of the topics by simple keywords and
conceived these are natural language queries to be
submitted to the collection represented in a
continuous bag of words space of a pretrained LLM.</p>
      <p>
        This way, all documents belongs to the topics with
a distinct relevance rank. This allowed to build a
knowledge graph in which each node represents the
ranked list of a topic and each edge between a pair of
nodes represents the fuzzy intersection list of the two
ranked lists [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>A user-evaluation experiment was conducted,
testing publicly available LLMs on a subset of
documents using distinct settings. This step aimed to
identify the best-performing model for further using
it for both implementing the information retrieval
module answering natural language queries and
classifying documents with respect to the topics. The
paper reports the steps of design of the KH and its
evaluation experiment for selecting the best model to
be applied in the future for documents’ classification
into topics.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Knowledge Hub design</title>
      <p>
        The first activity performed was the harvesting of the
documents from several potential sources of interest.
To this end we relied on the knowledge of a group of
experts of the leading institution ISPRA.
2.1. Harvesting Documents’ Collection
This step was aimed at identifying the documents
sources, i.e., the web sites and archives with
potentially interesting documents and at carrying out
their characterization with respect to some
meaningful dimensions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The documents in these information sources are more
than 10000, mainly files, and most of them are in PDF
format. Most information sources (20 out of 24)
contain documents, and 3 of these resources also
share images and tables, while only 3 out of 24
provide geographical layers. As far as the resources
are concerned, they are dedicated to 3 themes: law,
regulation and management of the sea (13 out of 24),
pollution (7) and biodiversity (2). Finally, 21 of the
classified repositories are open to the public, while
the remaining 3 are private or have restricted access.
From Regional Marine Pollution Emergency Response
Centre for the Mediterranean Sea [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Regional
Activity Centre for Specially Protected Areas [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
Regional Activity Centre for Sustainable Consumption
and Production [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Priority Actions
Programme/Regional Activity Centre [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], UNEP-MAP
library [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and UNEP library where the author was
marked as UNEP-MAP [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we harvested, through
website scraping, all the documents.
      </p>
      <p>For document harvesting, some code has been
developed both by CNR-IREA in the R language and
from INFO-RAC in Python language [26] freely
available under GNU GPL license.</p>
      <p>To share the files produced for the harvesting process,
a GitHub repository was created [27]. The "scraping"
folder contains the R and Python scripts developed for
scraping, the output of these files is in the "results"
folder.
2.2. Strategies for enabling documents
search
Once the collection was available, the methods of
representation and indexing of their content have
been selected.</p>
      <p>
        It was decided to experiment an up-to-date solution
based on state-of-the-art “semantic” indexing
methods using continuous bag of words [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. By this
approach users have complete freedom to formulate
natural language queries or keywords’ queries. In this
case the documents are retrieved if their contents are
“semantically” close to those of the query.
      </p>
      <p>
        To this end we experimented several LLMs available
publicly on hugging face library [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. All these models
imply the representation and management of the
“semantics” of information in a document corpus
which has been provided as training set. It must be
pointed out that, in this context, the term “semantics”
is improperly used since the LLMs identify regular
patterns in texts based on heuristic statistical
inference; thus, instead of “semantics”, the term
“relatedness” would be more appropriate. This way
they learn how to predict missing words in a sentence,
or how to continue a sentence, or to answer a query,
and, finally, to retrieve relevant documents in an ad
hoc retrieval task activated by a user query. Such
“semantics” models are the most effective in the case
one wants a natural language querying interaction,
since they can retrieve documents which do not
contain the specific query words, but synonymous
terms or concepts related with the query concepts.
In our context this approach was the most feasible
since we did not have available thesauri for expanding
the meaning of terms in the documents, being the
documents heterogeneous as far as both their themes
and genre. To this end, we have chosen pretrained
LLMs that have been set up for the ad hoc retrieval
task and based on evolutions of BERT, Bidirectional
Encoder Representations from Transformers [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ][
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
which is the Google state-of-the-art model using a
transformer architecture [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a deep neural network,
with self-attention mechanisms, that allows to keep
the context of words into account when creating their
representation as embeddings, i.e., as vectors of
continuous numeric values in a latent semantic space.
Once the LLMs have been selected, we defined the
architecture of the KH by specifying the preprocessing
phase that our corpus of documents should undergo
to become a readable input to the selected models.
The formats of the input documents, should be simple
text with punctuation marks allowing the
identification of single words, i.e., tokens; of
sentences, ending with punctuation marks like full
stop or semicolon, etc.; and of paragraphs, starting
with a new line. So, the non-conforming documents
consisting in pdf files had to be “translated” into text.
Furthermore, the processing steps have been
identified which has implied the selection of the
implementation libraries and environment in order to
code the whole process.
      </p>
      <p>We experimented hybridized techniques, for example,
the contents of queries and documents was
represented by applying different embedding
methods, and the same for the ranking of documents
using different similarity measures.</p>
      <p>Finally, we identified the most suitable open software
for the implementation of the components, the
indexing, the retrieval and the classification
components of the KH.</p>
      <p>
        Considering that there are a number of open source IR
libraries after a review we selected
SentenceTransformer python framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] that
makes several Hugging face pretrained models
available for sentence embeddings, and we exploited
also the python library NLTK (Natural Language
Toolkit [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) for managing corpus documents and
different tokenization strategies (i.e. the
aforementioned subdivisions of documents into
chunks, i.e., words, sentences, paragraphs or even
ngrams sentences, paragraphs, etc.). For our purposes
we deemed meaningful to compute different
combinations of pretrained LLMs, documents
representations based on different chunks definitions,
and matching function either dot product or cosine
similarity. Since documents may contain several
chunks depending on their length, we experimented
several aggregation functions of the chunks relevance
scores to compute the overall document relevance
score, i.e., the document ranking score. Specifically,
we applied a K-NN algorithm aggregation function by
increasing the number of the most relevant chunks
and by using as metrics the fuzzy document
cardinality measure [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        We have selected the following pre-trained LLMs
based on sentence-transformer architectures:
(a) msmarco-distilbert-cos-v5 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]: it maps sentences
&amp; paragraphs to a 768-dimensional dense vector
space and was designed for semantic search. It has
been trained on 500k (query, answer) pairs from
the MS MARCO Passages dataset(Microsoft
Machine Reading Comprehension) which is a large
scale dataset focused on machine reading
comprehension, question answering, and passage
ranking.
(b) all-MiniLM-L6-v2 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: it maps sentences &amp;
paragraphs to a 384-dimensional dense vector
space and can be used for tasks like clustering or
semantic search.
(c) msmarco-roberta-base-ance-firstp [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]: this is a
port of the ANCE FirstP Model, which uses a
training mechanism to select more realistic
negative training instances to the
sentencetransformers model: it maps sentences &amp;
paragraphs to a 768-dimensional dense vector
space and can be used for tasks like clustering or
semantic search.
(d) msmarco-bert-base-dot-v5 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]: it maps sentences
&amp; paragraphs to a 768-dimensional dense vector
space and was designed for semantic search. It has
been trained on 500K (query, answer) pairs from
the MS MARCO dataset.
(e) msmarco-distilbert-base-tas-b [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]: it is a port of
the DistilBert TAS-B Model to
sentencetransformers model: It maps sentences &amp;
paragraphs to a 768-dimensional dense vector
space and is optimized for the task of semantic
search.
2.3. Documents classification into topics
As far as the classification of the document corpus into
the topics, during the use case analysis the topics were
first identified by the seven keywords accounted for
in the UNESCO thesaurus [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], an RDF SKOS concept
scheme without definitions, as reported in table 1.
Then we identified “definitions” of each topic keyword
in renowned and authoritative sources as reported in
table 1, i.e., open domain websites, in the form of
textual abstracts. We then enriched the pre-existing
thesaurus by adding those definitions in the web of
data. The result is available both as linked data and
through a SPARQL endpoint [28].
      </p>
      <p>After choosing the best performing model evaluated
as explained in the next section, we applied it to
classify the whole collection into the topics, by
considering the topics’ definitions as queries. This
way a document can be assigned to multiple topics to
a different extent, where in the extent is the relevance
score with respect to a topic. The fuzzy intersection of
a pair of ranked lists yielded by two topics (computed
by their minimum) is the ranked list of documents at
the cross-road of both the topics.</p>
      <p>This way a knowledge graph can be built in which the
nodes are the ranked list of the single topics while the
edges are the ranked lists of documents at the
crossroad of pairs of topics.</p>
    </sec>
    <sec id="sec-3">
      <title>3. User Evaluation Experiment</title>
      <p>
        We have set up an evaluation experiment of the
different LLMs by randomly selecting a subset of 50
documents of the collection, engaging 3 users with
three distinct backgrounds (a physicist, an
environmental scientist and a biologist) who read
these documents and formulated 10-30 queries each
and for each query identified the list of their
respective relevant documents among the 50 ones.
We evaluated some metrics of retrieval effectiveness.
For our purposes we deemed meaningful to compute
mean Average Precision (mAP) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] of different
combinations of the 5 pretrained LLMs, documents
representations based on different chunks definitions,
i.e., sentence, fixed window size and paragraphs, and
matching functions (cosine similarity and dot
product). The results of the mAP for the tests are
reported in the following tables. They differ for the
computation of similarity. Table 2 corresponds to
cosine similarity, while Table 3 to dot product
similarity.
The first column is the pretrained model used
(indicated by the letter used in section 2.2). Second
column indicates the chunk type used, either
sentence, window/ngram, paragraph; then the size of
the input to the model is reported. The other columns
report the mAP averaged over all users and all queries
by considering different aggregation functions of the
chunks relevance scores.
      </p>
      <p>Several column names represent the parameters
passed to the aggregation function.
“#ch: &lt;number&gt;” is the parameter controlling the
number of the best chunks considered for computing
the document ranking score. When &lt;number&gt;=All, it
means that all chunks are taken into account.
The second parameter “avg” is a Boolean controlling if
the relevance score is defined as an average of the
chunks’ scores (in that case the parameter is used), or
if it corresponds to their sum (no indication of the
parameter appears). More in detail:
“#ch: N (sum)” indicates that the sum of the first N
best chunks’ scores of each document was computed;
“#ch: N (avg)” indicates that the average of the first N
best chunks’ scores of each document was computed;
When N=All it means that all the chunks in the
documents are considered.</p>
      <p>Since documents generally consist of long texts with
many chunks we applied also an approach in which
the document is represented by a single virtual
embedding vector computed as the average of the
chunks’ vectors. In this case the results of mAP are
reported in the column named “Virtual Doc” of Table
1.</p>
      <p>The last column named “max” reports the best mAP
obtained by any of the documents’ chunks for the
given setting in the row.</p>
      <p>It can be easily noticed that three distinct models
produce the maximum mAP = 0.64 for different
settings by using cosine similarity between pairs of
embedding vectors. Nevertheless, the most stable
model under different input settings (both window
and paragraph) and different matching definitions is
(b) all-MiniLM-L6-v2.</p>
      <p>Table 3 reports the mAP values when changing the
similarity metric by using the dot product. In this case
the best performing model is (e)
msmarco-distilbertbase-tas-b that, when feed with chunks defined by
sentences, reaches mAP = 0.65 when taking into
account from 4 to 6 best chunks’ relevance scores
using both the sum or their average.</p>
      <p>We thus select this latter model with the setting
chunks=sentences, number of chunks per document
to consider in the matching from 4 to 6 and either sum
of scores or their average.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>The originality of the described experience is
manifold: first of all, the experimentation of LLMs to
index and retrieve a highly heterogeneous collection
of documents and their compared evaluation
considering different chunk definitions, similarity
metrics, and last but not least, by evaluating different
aggregation strategies of the chunks relevance scores
to compute the overall rank of documents. This last
aspect is important in the case the documents are
long, consisting of many chunks as in our case.</p>
      <p>A second original contribution is the classification
of the documents into “fuzzy” overlapping topics,
according to a textual description of each topic which
is used as a natural language query to retrieve the
ranked list of documents belonging to the topic to a
given extent. This approach has been deemed feasible
to be applied for the implementation of the KH in
order to provide public authorities with a tool that can
aid them in searching all documentation they need for
the UNEP-MAP program.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The work has been carried out within the UNEP-MAP
Program of Work 2022-2023 in the framework of the
activity of the Information and Communication
Regional Activity Centre (INFO/RAC).
[26]
https://github.com/INFO-RAC/KMP</p>
      <p>library-scraping
[27]
https://github.com/IREA-CNR</p>
      <p>MI/inforac_ground_truth.
[28] http://rdfdata.get-it.it/inforac/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bordogna</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tagliolato</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oggioni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Babbini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2023</year>
          ). Report 2 -
          <string-name>
            <given-names>Semantic</given-names>
            <surname>Information Retrieval - Knowledge Hub</surname>
          </string-name>
          . Zenodo. https://doi.org/10.5281/zenodo.10260195
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kadhim</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          <article-title>Survey on supervised machine learning techniques for automatic text classification</article-title>
          .
          <source>Artif Intell Rev</source>
          <volume>52</volume>
          ,
          <fpage>273</fpage>
          -
          <lpage>292</lpage>
          (
          <year>2019</year>
          ). https://doi.org/10.1007/s10462- 018-09677-1
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Manning</surname>
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schütze</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <article-title>An Introduction to Information Retrieval, Online edition (c) 2009 Cambridge UP</article-title>
          , URL https://nlp.stanford.edu/IRbook/pdf/irbookonlinereading.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          et al.,
          <article-title>A comprehensive survey on pretrained foundation models: A history from bert to chatgpt</article-title>
          ,”
          <source>arXiv preprint arXiv:2302.09419</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kraft</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordogna</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasi</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Fuzzy Set Techniques in Information Retrieval</article-title>
          . (
          <year>1999</year>
          ).DOI:
          <volume>10</volume>
          .5281/zenodo.8082923
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>[6] REMPEC - https://www.rempec.org</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>[7] SPA/RAC - https://www.rac-spa.org</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>[8] SCP/RAC - http://www.cprac.org</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>[9] PAP/RAC - https://paprac.org</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[10] https://www.unep.org/unepmap/resource s/publications?/resources</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] https://wedocs.unep.org/discover?filtertyp e=
          <article-title>author&amp;filter_relational_operator=equals &amp;filter=UNEP%2FMAP</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Wolf</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanh</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaumond</surname>
            <given-names>J.</given-names>
          </string-name>
          , et al.,
          <article-title>HuggingFace's Transformers: State-of-theart Natural Language Processing</article-title>
          , https://arxiv.org/pdf/
          <year>1910</year>
          .03771.pdf
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Łukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17)</source>
          . Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Devlin</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>BERT</surname>
          </string-name>
          :
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <source>Proc. of NAACL-HLT</source>
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>[15] https://github.com/UKPLab/sentencetransformers</mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>[16] https://www.nltk.org/</mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yager</surname>
            ,
            <given-names>R. R.</given-names>
          </string-name>
          <article-title>On the fuzzy cardinality of a fuzzy set</article-title>
          .
          <source>International Journal of General Systems</source>
          ,
          <volume>35</volume>
          (
          <issue>2</issue>
          ),
          <fpage>191</fpage>
          -
          <lpage>206</lpage>
          ., https://doi.org/10.1080/03081070500422 729,
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18] https://huggingface.co/sentencetransformers/msmarco-distilbert
          <article-title>-cos-v5</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>[19] https://huggingface.co/sentencetransformers/all-MiniLM-L6-v2</mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20] https://huggingface.co/sentencetransformers/msmarco-roberta
          <string-name>
            <surname>-</surname>
          </string-name>
          base-ancefirstp
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21] https://huggingface.co/sentencetransformers/msmarco-bert
          <article-title>-base-dot-v5</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22] https://huggingface.co/sentencetransformers/msmarco-distilbert
          <string-name>
            <surname>-</surname>
          </string-name>
          base-tasb
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>[23] http://vocabularies.unesco.org/thesaurus</mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>[24] http://fuseki1.get-it.it/inforac/sparql</mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Beitzel</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>E.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frieder</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>MAP</surname>
            . In: LIU,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ÖZSU</surname>
          </string-name>
          , M.T. (eds) Encyclopedia
          <source>of Database Systems</source>
          . Springer, Boston, MA. https://doi.org/10.1007/978-0-
          <fpage>387</fpage>
          - 39940-9_
          <fpage>492</fpage>
          <lpage>2009</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>