=Paper=
{{Paper
|id=Vol-3762/533
|storemode=property
|title=Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities
|pdfUrl=https://ceur-ws.org/Vol-3762/533.pdf
|volume=Vol-3762
|authors=Paolo Tagliolato Acquaviva d'Aragona,Lorenza Babbini,Gloria Bordogna,Alessandro Lotti,Annalisa Minelli,Alessandro Oggioni
|dblpUrl=https://dblp.org/rec/conf/ital-ia/dAragonaBBLMO24
}}
==Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities==
Design of a Knowledge Hub of Heterogeneous Multisource
Documents to support Public Authorities
Paolo Tagliolato Acquaviva D’Aragona 1,*,†, Lorenza Babbini 2,†, Gloria Bordogna , 1, †,
Alessandro Lotti2, †, Annalisa Minelli2, † and Alessandro Oggioni 1, †
1 CNR – IREA , via Corti 12, Milano, 20133, Italy
2 INFO/RAC UNEP-MAP c/o ISPRA, DG-SINA , via Vitaliano Brancati 48, Roma, 00144, Italy
Abstract
This contribution outlines the design of a Knowledge Hub of heterogeneous documents related
to the Mediterranean Action Plan UNEP-MAP of the United Nations Environment Program [1].
The Knowledge Hub is intended to serve as a resource to assist public authorities and users
with different backgrounds and needs in accessing information efficiently. Users can either
formulate natural language queries or navigate a knowledge graph automatically generated to
find relevant documents. The Knowledge Hub is designed based on state-of-the-art Large
Language Models. (LLMs) A user-evaluation experiment was conducted, testing publicly
available models on a subset of documents using distinct LLMs settings. This step was aimed to
identify the best-performing model for further using it to classify the documents with respect to
the topics of interest.
Keywords
Knowledge Hub, Large Language Models, Natural Language Queries, Knowledge graph.1
1. Introduction the hub constitutes a knowledge base for the
stakeholders of the Mediterranean Action Plan: The
This contribution reports the feasibility study interested public authorities have users with different
carried out for the design of a Knowledge Hub (KH) background knowledge and needs, including
for accessing documents, which is part of the politicians, administrators, environmental scientists,
Knowledge Management Platform (KMaP), a platform projects leaders and citizens, who need to search as
constituting the unique access point of all knowledge well as to navigate the distributed archive.
heritage for the United Nations Environmental During the use case analysis, carried out by
Program for the Mediterranean Action Plan (UNEP- interviews to some potential stakeholders, it was
MAP) [1]. deemed important that the KH would support users to
The KH is conceived as an access point to highly perform searches by formulating queries in natural
heterogeneous multimedia documents distributed on language, and would guide them in navigating the
the Web, among the network of United Nations collection by providing an organized view of the
Environmental Program for the Mediterranean Action documents into topics of interest [2].
Plan, about marine studies, political and economic To this aim, main critical aspects had to be
directives, environmental studies and in general as considered to provide feasible solutions: the
part of UNEP-MAP protocols and activities. For the document collection is highly heterogeneous as far as
nature of the contents dealt with in the documents, the genre, some being minutes of meetings while
Ital-IA 2024: 4th National Conference on Artificial Intelligence, 0000-0002-0261-313X (P. Tagliolato); 0000-0003-3302-6891
organized by CINI, May 29-30, 2024, Naples, Italy (L. Babbini); 0000-0002-6775-753X (G. Bordogna); 0000-0002-
∗ Corresponding author. 4837-4357 (A. Lotti); 0000-0003-1772-0154 (A. Minelli); 0000-
† These authors contributed equally. 0002-7997-219X (A. Oggioni)
paolo.tagliolatoacquavivadar@ cnr.it (F. Author); © 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
others being scientific reports, with highly variable format. Most information sources (20 out of 24)
lengths, some documents being of one page while contain documents, and 3 of these resources also
others being reports of hundred pages, in different share images and tables, while only 3 out of 24
languages with varying formats (mostly being in pdf provide geographical layers. As far as the resources
others in html and jpg). Finally, the identification of are concerned, they are dedicated to 3 themes: law,
the topics made during the use case analysis revealed regulation and management of the sea (13 out of 24),
that it is not so easy to tell apart which documents pollution (7) and biodiversity (2). Finally, 21 of the
belong to a topic, being some of them at the cross-road classified repositories are open to the public, while
of several topics. the remaining 3 are private or have restricted access.
The approach that we deemed flexible to apply for From Regional Marine Pollution Emergency Response
enabling natural language searches was identified as Centre for the Mediterranean Sea [6], Regional
an Information Retrieval system [3] defined based on Activity Centre for Specially Protected Areas [7],
Large Language Models (LLMs), and specifically on Regional Activity Centre for Sustainable Consumption
open source pre-trained LLMs [4]. and Production [8], Priority Actions
To aid the organization of the documents into the Programme/Regional Activity Centre [9], UNEP-MAP
topics we then retrieved natural language library [10] and UNEP library where the author was
descriptions of the topics by simple keywords and marked as UNEP-MAP [11], we harvested, through
conceived these are natural language queries to be website scraping, all the documents.
submitted to the collection represented in a For document harvesting, some code has been
continuous bag of words space of a pretrained LLM. developed both by CNR-IREA in the R language and
This way, all documents belongs to the topics with from INFO-RAC in Python language [26] freely
a distinct relevance rank. This allowed to build a available under GNU GPL license.
knowledge graph in which each node represents the To share the files produced for the harvesting process,
ranked list of a topic and each edge between a pair of a GitHub repository was created [27]. The "scraping"
nodes represents the fuzzy intersection list of the two folder contains the R and Python scripts developed for
ranked lists [5]. scraping, the output of these files is in the "results"
A user-evaluation experiment was conducted, folder.
testing publicly available LLMs on a subset of
documents using distinct settings. This step aimed to 2.2. Strategies for enabling documents
identify the best-performing model for further using
search
it for both implementing the information retrieval
Once the collection was available, the methods of
module answering natural language queries and
representation and indexing of their content have
classifying documents with respect to the topics. The
been selected.
paper reports the steps of design of the KH and its
It was decided to experiment an up-to-date solution
evaluation experiment for selecting the best model to
based on state-of-the-art “semantic” indexing
be applied in the future for documents’ classification
methods using continuous bag of words [4]. By this
into topics.
approach users have complete freedom to formulate
natural language queries or keywords’ queries. In this
case the documents are retrieved if their contents are
2. Knowledge Hub design “semantically” close to those of the query.
The first activity performed was the harvesting of the To this end we experimented several LLMs available
documents from several potential sources of interest. publicly on hugging face library [12]. All these models
To this end we relied on the knowledge of a group of imply the representation and management of the
experts of the leading institution ISPRA. “semantics” of information in a document corpus
which has been provided as training set. It must be
pointed out that, in this context, the term “semantics”
2.1. Harvesting Documents’ Collection
is improperly used since the LLMs identify regular
This step was aimed at identifying the documents
patterns in texts based on heuristic statistical
sources, i.e., the web sites and archives with
inference; thus, instead of “semantics”, the term
potentially interesting documents and at carrying out
“relatedness” would be more appropriate. This way
their characterization with respect to some
they learn how to predict missing words in a sentence,
meaningful dimensions [5].
or how to continue a sentence, or to answer a query,
The documents in these information sources are more
and, finally, to retrieve relevant documents in an ad
than 10000, mainly files, and most of them are in PDF
hoc retrieval task activated by a user query. Such grams sentences, paragraphs, etc.). For our purposes
“semantics” models are the most effective in the case we deemed meaningful to compute different
one wants a natural language querying interaction, combinations of pretrained LLMs, documents
since they can retrieve documents which do not representations based on different chunks definitions,
contain the specific query words, but synonymous and matching function either dot product or cosine
terms or concepts related with the query concepts. similarity. Since documents may contain several
In our context this approach was the most feasible chunks depending on their length, we experimented
since we did not have available thesauri for expanding several aggregation functions of the chunks relevance
the meaning of terms in the documents, being the scores to compute the overall document relevance
documents heterogeneous as far as both their themes score, i.e., the document ranking score. Specifically,
and genre. To this end, we have chosen pretrained we applied a K-NN algorithm aggregation function by
LLMs that have been set up for the ad hoc retrieval increasing the number of the most relevant chunks
task and based on evolutions of BERT, Bidirectional and by using as metrics the fuzzy document
Encoder Representations from Transformers [13][14] cardinality measure [17].
which is the Google state-of-the-art model using a We have selected the following pre-trained LLMs
transformer architecture [13], a deep neural network, based on sentence-transformer architectures:
with self-attention mechanisms, that allows to keep
the context of words into account when creating their (a) msmarco-distilbert-cos-v5 [18]: it maps sentences
representation as embeddings, i.e., as vectors of & paragraphs to a 768-dimensional dense vector
continuous numeric values in a latent semantic space. space and was designed for semantic search. It has
Once the LLMs have been selected, we defined the been trained on 500k (query, answer) pairs from
architecture of the KH by specifying the preprocessing the MS MARCO Passages dataset(Microsoft
phase that our corpus of documents should undergo Machine Reading Comprehension) which is a large
to become a readable input to the selected models. scale dataset focused on machine reading
The formats of the input documents, should be simple comprehension, question answering, and passage
text with punctuation marks allowing the ranking.
identification of single words, i.e., tokens; of (b) all-MiniLM-L6-v2 [19]: it maps sentences &
sentences, ending with punctuation marks like full paragraphs to a 384-dimensional dense vector
stop or semicolon, etc.; and of paragraphs, starting space and can be used for tasks like clustering or
with a new line. So, the non-conforming documents semantic search.
consisting in pdf files had to be “translated” into text. (c) msmarco-roberta-base-ance-firstp [20]: this is a
Furthermore, the processing steps have been port of the ANCE FirstP Model, which uses a
identified which has implied the selection of the training mechanism to select more realistic
implementation libraries and environment in order to negative training instances to the sentence-
code the whole process. transformers model: it maps sentences &
We experimented hybridized techniques, for example, paragraphs to a 768-dimensional dense vector
the contents of queries and documents was space and can be used for tasks like clustering or
represented by applying different embedding semantic search.
methods, and the same for the ranking of documents (d) msmarco-bert-base-dot-v5 [21]: it maps sentences
using different similarity measures. & paragraphs to a 768-dimensional dense vector
Finally, we identified the most suitable open software space and was designed for semantic search. It has
for the implementation of the components, the been trained on 500K (query, answer) pairs from
indexing, the retrieval and the classification the MS MARCO dataset.
components of the KH. (e) msmarco-distilbert-base-tas-b [22]: it is a port of
Considering that there are a number of open source IR the DistilBert TAS-B Model to sentence-
libraries after a review we selected transformers model: It maps sentences &
SentenceTransformer python framework [15] that paragraphs to a 768-dimensional dense vector
makes several Hugging face pretrained models space and is optimized for the task of semantic
available for sentence embeddings, and we exploited search.
also the python library NLTK (Natural Language
Toolkit [16]) for managing corpus documents and 2.3. Documents classification into topics
different tokenization strategies (i.e. the As far as the classification of the document corpus into
aforementioned subdivisions of documents into the topics, during the use case analysis the topics were
chunks, i.e., words, sentences, paragraphs or even n- first identified by the seven keywords accounted for
in the UNESCO thesaurus [23], an RDF SKOS concept score with respect to a topic. The fuzzy intersection of
scheme without definitions, as reported in table 1. a pair of ranked lists yielded by two topics (computed
Then we identified “definitions” of each topic keyword by their minimum) is the ranked list of documents at
in renowned and authoritative sources as reported in the cross-road of both the topics.
table 1, i.e., open domain websites, in the form of This way a knowledge graph can be built in which the
textual abstracts. We then enriched the pre-existing nodes are the ranked list of the single topics while the
thesaurus by adding those definitions in the web of edges are the ranked lists of documents at the cross-
data. The result is available both as linked data and road of pairs of topics.
through a SPARQL endpoint [28].
3. User Evaluation Experiment
Table 1:
Topics keywords and sources for their definitions as We have set up an evaluation experiment of the
short abstracts different LLMs by randomly selecting a subset of 50
documents of the collection, engaging 3 users with
Topics keyword Definitions
three distinct backgrounds (a physicist, an
Source
environmental scientist and a biologist) who read
Climate change United Nations these documents and formulated 10-30 queries each
Marine biodiversity UN and for each query identified the list of their
Sustainability and blue economy UN respective relevant documents among the 50 ones.
Pollution National We evaluated some metrics of retrieval effectiveness.
Geographic For our purposes we deemed meaningful to compute
Marine spatial planning EU commission mean Average Precision (mAP) [25] of different
Fishery and aquaculture FAO combinations of the 5 pretrained LLMs, documents
Governance UN Dev. Progr. representations based on different chunks definitions,
i.e., sentence, fixed window size and paragraphs, and
matching functions (cosine similarity and dot
After choosing the best performing model evaluated product). The results of the mAP for the tests are
as explained in the next section, we applied it to reported in the following tables. They differ for the
classify the whole collection into the topics, by computation of similarity. Table 2 corresponds to
considering the topics’ definitions as queries. This cosine similarity, while Table 3 to dot product
way a document can be assigned to multiple topics to similarity.
a different extent, where in the extent is the relevance
Table 2:
mAP for different LLMs/chunks and cosine similarity
Table 3:
mAP for different LLMs/chunks and dot-product
settings by using cosine similarity between pairs of
The first column is the pretrained model used embedding vectors. Nevertheless, the most stable
(indicated by the letter used in section 2.2). Second model under different input settings (both window
column indicates the chunk type used, either and paragraph) and different matching definitions is
sentence, window/ngram, paragraph; then the size of (b) all-MiniLM-L6-v2.
the input to the model is reported. The other columns Table 3 reports the mAP values when changing the
report the mAP averaged over all users and all queries similarity metric by using the dot product. In this case
by considering different aggregation functions of the the best performing model is (e) msmarco-distilbert-
chunks relevance scores. base-tas-b that, when feed with chunks defined by
Several column names represent the parameters sentences, reaches mAP = 0.65 when taking into
passed to the aggregation function. account from 4 to 6 best chunks’ relevance scores
“#ch: ” is the parameter controlling the using both the sum or their average.
number of the best chunks considered for computing We thus select this latter model with the setting
the document ranking score. When =All, it chunks=sentences, number of chunks per document
means that all chunks are taken into account. to consider in the matching from 4 to 6 and either sum
The second parameter “avg” is a Boolean controlling if of scores or their average.
the relevance score is defined as an average of the
chunks’ scores (in that case the parameter is used), or
if it corresponds to their sum (no indication of the 4. Conclusions
parameter appears). More in detail:
“#ch: N (sum)” indicates that the sum of the first N The originality of the described experience is
best chunks’ scores of each document was computed; manifold: first of all, the experimentation of LLMs to
“#ch: N (avg)” indicates that the average of the first N index and retrieve a highly heterogeneous collection
best chunks’ scores of each document was computed; of documents and their compared evaluation
When N=All it means that all the chunks in the considering different chunk definitions, similarity
documents are considered. metrics, and last but not least, by evaluating different
Since documents generally consist of long texts with aggregation strategies of the chunks relevance scores
many chunks we applied also an approach in which to compute the overall rank of documents. This last
the document is represented by a single virtual aspect is important in the case the documents are
embedding vector computed as the average of the long, consisting of many chunks as in our case.
chunks’ vectors. In this case the results of mAP are A second original contribution is the classification
reported in the column named “Virtual Doc” of Table of the documents into “fuzzy” overlapping topics,
1. according to a textual description of each topic which
The last column named “max” reports the best mAP is used as a natural language query to retrieve the
obtained by any of the documents’ chunks for the ranked list of documents belonging to the topic to a
given setting in the row. given extent. This approach has been deemed feasible
It can be easily noticed that three distinct models to be applied for the implementation of the KH in
produce the maximum mAP = 0.64 for different order to provide public authorities with a tool that can
aid them in searching all documentation they need for Conference on Neural Information
the UNEP-MAP program. Processing Systems (NIPS'17). Curran
Associates Inc., Red Hook, NY, USA, 6000–
Acknowledgements 6010.
[14] Devlin J., Chang M.W., Lee K., Toutanova K.,
The work has been carried out within the UNEP-MAP BERT: Pre-training of Deep Bidirectional
Program of Work 2022-2023 in the framework of the Transformers for Language Understanding,
activity of the Information and Communication Proc. of NAACL-HLT 2019, pp. 4171–4186.
Regional Activity Centre (INFO/RAC). [15] https://github.com/UKPLab/sentence-
transformers
References [16] https://www.nltk.org/
[1] Bordogna, G., Tagliolato, P., Lotti, A., Minelli, [17] Yager, R. R. On the fuzzy cardinality of a
A., Oggioni, A., & Babbini, L. (2023). Report 2 fuzzy set. International Journal of General
- Semantic Information Retrieval – Systems, 35(2), 191–206.,
Knowledge Hub. Zenodo. https://doi.org/10.1080/03081070500422
https://doi.org/10.5281/zenodo.10260195 729, 2006
[2] Kadhim, A.I. Survey on supervised machine [18] https://huggingface.co/sentence-
learning techniques for automatic text transformers/msmarco-distilbert-cos-v5
classification. Artif Intell Rev 52, 273–292 [19] https://huggingface.co/sentence-
(2019). https://doi.org/10.1007/s10462- transformers/all-MiniLM-L6-v2
018-09677-1 [20] https://huggingface.co/sentence-
[3] Manning C.D., Raghavan P., Schütze H., An transformers/msmarco-roberta-base-ance-
Introduction to Information Retrieval, firstp
Online edition (c) 2009 Cambridge UP, URL [21] https://huggingface.co/sentence-
https://nlp.stanford.edu/IR- transformers/msmarco-bert-base-dot-v5
book/pdf/irbookonlinereading.pdf [22] https://huggingface.co/sentence-
[4] Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. transformers/msmarco-distilbert-base-tas-
Zhang, C. Ji, Q. Yan, L. He et al., A b
comprehensive survey on pretrained [23] http://vocabularies.unesco.org/thesaurus
foundation models: A history from bert to [24] http://fuseki1.get-it.it/inforac/sparql
chatgpt,” arXiv preprint arXiv:2302.09419, [25] Beitzel, S.M., Jensen, E.C., Frieder, O. MAP. In:
2023. LIU, L., ÖZSU, M.T. (eds) Encyclopedia of
[5] Kraft, D. H., Bordogna G., Pasi G. Fuzzy Set Database Systems. Springer, Boston, MA.
Techniques in Information Retrieval. https://doi.org/10.1007/978-0-387-
(1999).DOI: 10.5281/zenodo.8082923 39940-9_492 2009
[6] REMPEC - https://www.rempec.org
[7] SPA/RAC - https://www.rac-spa.org A. Online Resources
[8] SCP/RAC - http://www.cprac.org [26] https://github.com/INFO-RAC/KMP-
[9] PAP/RAC - https://paprac.org library-scraping
[10] https://www.unep.org/unepmap/resource [27] https://github.com/IREA-CNR-
s/publications?/resources MI/inforac_ground_truth.
[11] https://wedocs.unep.org/discover?filtertyp [28] http://rdfdata.get-it.it/inforac/
e=author&filter_relational_operator=equals
&filter=UNEP%2FMAP
[12] Wolf T., Debut L., Sanh V., Chaumond J., et al.,
HuggingFace's Transformers: State-of-the-
art Natural Language Processing,
https://arxiv.org/pdf/1910.03771.pdf
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In
Proceedings of the 31st International