=Paper=
{{Paper
|id=Vol-3762/533
|storemode=property
|title=Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities
|pdfUrl=https://ceur-ws.org/Vol-3762/533.pdf
|volume=Vol-3762
|authors=Paolo Tagliolato Acquaviva d'Aragona,Lorenza Babbini,Gloria Bordogna,Alessandro Lotti,Annalisa Minelli,Alessandro Oggioni
|dblpUrl=https://dblp.org/rec/conf/ital-ia/dAragonaBBLMO24
}}
==Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities==
<pdf width="1500px">https://ceur-ws.org/Vol-3762/533.pdf</pdf>
<pre>
                                Design of a Knowledge Hub of Heterogeneous Multisource
                                Documents to support Public Authorities
                                Paolo Tagliolato Acquaviva D’Aragona 1,*,†, Lorenza Babbini 2,†, Gloria Bordogna , 1, †,
                                Alessandro Lotti2, †, Annalisa Minelli2, † and Alessandro Oggioni 1, †

                                1 CNR – IREA , via Corti 12, Milano, 20133, Italy
                                2 INFO/RAC UNEP-MAP c/o ISPRA, DG-SINA , via Vitaliano Brancati 48, Roma, 00144, Italy


                                                    Abstract

                                                    This contribution outlines the design of a Knowledge Hub of heterogeneous documents related
                                                    to the Mediterranean Action Plan UNEP-MAP of the United Nations Environment Program [1].
                                                    The Knowledge Hub is intended to serve as a resource to assist public authorities and users
                                                    with different backgrounds and needs in accessing information efficiently. Users can either
                                                    formulate natural language queries or navigate a knowledge graph automatically generated to
                                                    find relevant documents. The Knowledge Hub is designed based on state-of-the-art Large
                                                    Language Models. (LLMs) A user-evaluation experiment was conducted, testing publicly
                                                    available models on a subset of documents using distinct LLMs settings. This step was aimed to
                                                    identify the best-performing model for further using it to classify the documents with respect to
                                                    the topics of interest.

                                                    Keywords
                                                    Knowledge Hub, Large Language Models, Natural Language Queries, Knowledge graph.1


                                1. Introduction                                                     the hub constitutes a knowledge base for the
                                                                                                    stakeholders of the Mediterranean Action Plan: The
                                    This contribution reports the feasibility study                 interested public authorities have users with different
                                carried out for the design of a Knowledge Hub (KH)                  background knowledge and needs, including
                                for accessing documents, which is part of the                       politicians, administrators, environmental scientists,
                                Knowledge Management Platform (KMaP), a platform                    projects leaders and citizens, who need to search as
                                constituting the unique access point of all knowledge               well as to navigate the distributed archive.
                                heritage for the United Nations Environmental                           During the use case analysis, carried out by
                                Program for the Mediterranean Action Plan (UNEP-                    interviews to some potential stakeholders, it was
                                MAP) [1].                                                           deemed important that the KH would support users to
                                    The KH is conceived as an access point to highly                perform searches by formulating queries in natural
                                heterogeneous multimedia documents distributed on                   language, and would guide them in navigating the
                                the Web, among the network of United Nations                        collection by providing an organized view of the
                                Environmental Program for the Mediterranean Action                  documents into topics of interest [2].
                                Plan, about marine studies, political and economic                      To this aim, main critical aspects had to be
                                directives, environmental studies and in general as                 considered to provide feasible solutions: the
                                part of UNEP-MAP protocols and activities. For the                  document collection is highly heterogeneous as far as
                                nature of the contents dealt with in the documents,                 the genre, some being minutes of meetings while


                                Ital-IA 2024: 4th National Conference on Artificial Intelligence,        0000-0002-0261-313X (P. Tagliolato); 0000-0003-3302-6891
                                organized by CINI, May 29-30, 2024, Naples, Italy                     (L. Babbini); 0000-0002-6775-753X (G. Bordogna); 0000-0002-
                                ∗ Corresponding author.                                               4837-4357 (A. Lotti); 0000-0003-1772-0154 (A. Minelli); 0000-
                                † These authors contributed equally.                                  0002-7997-219X (A. Oggioni)
                                   paolo.tagliolatoacquavivadar@ cnr.it (F. Author);                               © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                   Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
others being scientific reports, with highly variable     format. Most information sources (20 out of 24)
lengths, some documents being of one page while           contain documents, and 3 of these resources also
others being reports of hundred pages, in different       share images and tables, while only 3 out of 24
languages with varying formats (mostly being in pdf       provide geographical layers. As far as the resources
others in html and jpg). Finally, the identification of   are concerned, they are dedicated to 3 themes: law,
the topics made during the use case analysis revealed     regulation and management of the sea (13 out of 24),
that it is not so easy to tell apart which documents      pollution (7) and biodiversity (2). Finally, 21 of the
belong to a topic, being some of them at the cross-road   classified repositories are open to the public, while
of several topics.                                        the remaining 3 are private or have restricted access.
    The approach that we deemed flexible to apply for     From Regional Marine Pollution Emergency Response
enabling natural language searches was identified as      Centre for the Mediterranean Sea [6], Regional
an Information Retrieval system [3] defined based on      Activity Centre for Specially Protected Areas [7],
Large Language Models (LLMs), and specifically on         Regional Activity Centre for Sustainable Consumption
open source pre-trained LLMs [4].                         and       Production      [8],    Priority      Actions
    To aid the organization of the documents into the     Programme/Regional Activity Centre [9], UNEP-MAP
topics we then retrieved natural language                 library [10] and UNEP library where the author was
descriptions of the topics by simple keywords and         marked as UNEP-MAP [11], we harvested, through
conceived these are natural language queries to be        website scraping, all the documents.
submitted to the collection represented in a              For document harvesting, some code has been
continuous bag of words space of a pretrained LLM.        developed both by CNR-IREA in the R language and
    This way, all documents belongs to the topics with    from INFO-RAC in Python language [26] freely
a distinct relevance rank. This allowed to build a        available under GNU GPL license.
knowledge graph in which each node represents the         To share the files produced for the harvesting process,
ranked list of a topic and each edge between a pair of    a GitHub repository was created [27]. The "scraping"
nodes represents the fuzzy intersection list of the two   folder contains the R and Python scripts developed for
ranked lists [5].                                         scraping, the output of these files is in the "results"
    A user-evaluation experiment was conducted,           folder.
testing publicly available LLMs on a subset of
documents using distinct settings. This step aimed to     2.2. Strategies for enabling documents
identify the best-performing model for further using
                                                          search
it for both implementing the information retrieval
                                                          Once the collection was available, the methods of
module answering natural language queries and
                                                          representation and indexing of their content have
classifying documents with respect to the topics. The
                                                          been selected.
paper reports the steps of design of the KH and its
                                                          It was decided to experiment an up-to-date solution
evaluation experiment for selecting the best model to
                                                          based on state-of-the-art “semantic” indexing
be applied in the future for documents’ classification
                                                          methods using continuous bag of words [4]. By this
into topics.
                                                          approach users have complete freedom to formulate
                                                          natural language queries or keywords’ queries. In this
                                                          case the documents are retrieved if their contents are
2. Knowledge Hub design                                   “semantically” close to those of the query.
The first activity performed was the harvesting of the    To this end we experimented several LLMs available
documents from several potential sources of interest.     publicly on hugging face library [12]. All these models
To this end we relied on the knowledge of a group of      imply the representation and management of the
experts of the leading institution ISPRA.                 “semantics” of information in a document corpus
                                                          which has been provided as training set. It must be
                                                          pointed out that, in this context, the term “semantics”
2.1. Harvesting Documents’ Collection
                                                          is improperly used since the LLMs identify regular
This step was aimed at identifying the documents
                                                          patterns in texts based on heuristic statistical
sources, i.e., the web sites and archives with
                                                          inference; thus, instead of “semantics”, the term
potentially interesting documents and at carrying out
                                                          “relatedness” would be more appropriate. This way
their characterization with respect to some
                                                          they learn how to predict missing words in a sentence,
meaningful dimensions [5].
                                                          or how to continue a sentence, or to answer a query,
The documents in these information sources are more
                                                          and, finally, to retrieve relevant documents in an ad
than 10000, mainly files, and most of them are in PDF
hoc retrieval task activated by a user query. Such          grams sentences, paragraphs, etc.). For our purposes
“semantics” models are the most effective in the case       we deemed meaningful to compute different
one wants a natural language querying interaction,          combinations of pretrained LLMs, documents
since they can retrieve documents which do not              representations based on different chunks definitions,
contain the specific query words, but synonymous            and matching function either dot product or cosine
terms or concepts related with the query concepts.          similarity. Since documents may contain several
In our context this approach was the most feasible          chunks depending on their length, we experimented
since we did not have available thesauri for expanding      several aggregation functions of the chunks relevance
the meaning of terms in the documents, being the            scores to compute the overall document relevance
documents heterogeneous as far as both their themes         score, i.e., the document ranking score. Specifically,
and genre. To this end, we have chosen pretrained           we applied a K-NN algorithm aggregation function by
LLMs that have been set up for the ad hoc retrieval         increasing the number of the most relevant chunks
task and based on evolutions of BERT, Bidirectional         and by using as metrics the fuzzy document
Encoder Representations from Transformers [13][14]          cardinality measure [17].
which is the Google state-of-the-art model using a          We have selected the following pre-trained LLMs
transformer architecture [13], a deep neural network,       based on sentence-transformer architectures:
with self-attention mechanisms, that allows to keep
the context of words into account when creating their       (a) msmarco-distilbert-cos-v5 [18]: it maps sentences
representation as embeddings, i.e., as vectors of               & paragraphs to a 768-dimensional dense vector
continuous numeric values in a latent semantic space.           space and was designed for semantic search. It has
Once the LLMs have been selected, we defined the                been trained on 500k (query, answer) pairs from
architecture of the KH by specifying the preprocessing          the MS MARCO Passages dataset(Microsoft
phase that our corpus of documents should undergo               Machine Reading Comprehension) which is a large
to become a readable input to the selected models.              scale dataset focused on machine reading
The formats of the input documents, should be simple            comprehension, question answering, and passage
text with punctuation marks allowing the                        ranking.
identification of single words, i.e., tokens; of            (b) all-MiniLM-L6-v2 [19]: it maps sentences &
sentences, ending with punctuation marks like full              paragraphs to a 384-dimensional dense vector
stop or semicolon, etc.; and of paragraphs, starting            space and can be used for tasks like clustering or
with a new line. So, the non-conforming documents               semantic search.
consisting in pdf files had to be “translated” into text.   (c) msmarco-roberta-base-ance-firstp [20]: this is a
Furthermore, the processing steps have been                     port of the ANCE FirstP Model, which uses a
identified which has implied the selection of the               training mechanism to select more realistic
implementation libraries and environment in order to            negative training instances to the sentence-
code the whole process.                                         transformers model: it maps sentences &
We experimented hybridized techniques, for example,             paragraphs to a 768-dimensional dense vector
the contents of queries and documents was                       space and can be used for tasks like clustering or
represented by applying different embedding                     semantic search.
methods, and the same for the ranking of documents          (d) msmarco-bert-base-dot-v5 [21]: it maps sentences
using different similarity measures.                            & paragraphs to a 768-dimensional dense vector
Finally, we identified the most suitable open software          space and was designed for semantic search. It has
for the implementation of the components, the                   been trained on 500K (query, answer) pairs from
indexing, the retrieval and the classification                  the MS MARCO dataset.
components of the KH.                                       (e) msmarco-distilbert-base-tas-b [22]: it is a port of
Considering that there are a number of open source IR           the DistilBert TAS-B Model to sentence-
libraries     after     a    review      we      selected       transformers model: It maps sentences &
SentenceTransformer python framework [15] that                  paragraphs to a 768-dimensional dense vector
makes several Hugging face pretrained models                    space and is optimized for the task of semantic
available for sentence embeddings, and we exploited             search.
also the python library NLTK (Natural Language
Toolkit [16]) for managing corpus documents and             2.3. Documents classification into topics
different     tokenization     strategies    (i.e.    the   As far as the classification of the document corpus into
aforementioned subdivisions of documents into               the topics, during the use case analysis the topics were
chunks, i.e., words, sentences, paragraphs or even n-       first identified by the seven keywords accounted for
in the UNESCO thesaurus [23], an RDF SKOS concept          score with respect to a topic. The fuzzy intersection of
scheme without definitions, as reported in table 1.        a pair of ranked lists yielded by two topics (computed
Then we identified “definitions” of each topic keyword     by their minimum) is the ranked list of documents at
in renowned and authoritative sources as reported in       the cross-road of both the topics.
table 1, i.e., open domain websites, in the form of        This way a knowledge graph can be built in which the
textual abstracts. We then enriched the pre-existing       nodes are the ranked list of the single topics while the
thesaurus by adding those definitions in the web of        edges are the ranked lists of documents at the cross-
data. The result is available both as linked data and      road of pairs of topics.
through a SPARQL endpoint [28].
                                                           3. User Evaluation Experiment
Table 1:
Topics keywords and sources for their definitions as       We have set up an evaluation experiment of the
short abstracts                                            different LLMs by randomly selecting a subset of 50
                                                           documents of the collection, engaging 3 users with
Topics keyword                         Definitions
                                                           three distinct backgrounds (a physicist, an
                                       Source
                                                           environmental scientist and a biologist) who read
Climate change                        United Nations       these documents and formulated 10-30 queries each
Marine biodiversity                   UN                   and for each query identified the list of their
Sustainability and blue economy       UN                   respective relevant documents among the 50 ones.
Pollution                             National             We evaluated some metrics of retrieval effectiveness.
                                      Geographic           For our purposes we deemed meaningful to compute
Marine spatial planning               EU commission        mean Average Precision (mAP) [25] of different
Fishery and aquaculture               FAO                  combinations of the 5 pretrained LLMs, documents
Governance                            UN Dev. Progr.       representations based on different chunks definitions,
                                                           i.e., sentence, fixed window size and paragraphs, and
                                                           matching functions (cosine similarity and dot
After choosing the best performing model evaluated         product). The results of the mAP for the tests are
as explained in the next section, we applied it to         reported in the following tables. They differ for the
classify the whole collection into the topics, by          computation of similarity. Table 2 corresponds to
considering the topics’ definitions as queries. This       cosine similarity, while Table 3 to dot product
way a document can be assigned to multiple topics to       similarity.
a different extent, where in the extent is the relevance

Table 2:
mAP for different LLMs/chunks and cosine similarity
Table 3:
mAP for different LLMs/chunks and dot-product


                                                           settings by using cosine similarity between pairs of
The first column is the pretrained model used              embedding vectors. Nevertheless, the most stable
(indicated by the letter used in section 2.2). Second      model under different input settings (both window
column indicates the chunk type used, either               and paragraph) and different matching definitions is
sentence, window/ngram, paragraph; then the size of        (b) all-MiniLM-L6-v2.
the input to the model is reported. The other columns      Table 3 reports the mAP values when changing the
report the mAP averaged over all users and all queries     similarity metric by using the dot product. In this case
by considering different aggregation functions of the      the best performing model is (e) msmarco-distilbert-
chunks relevance scores.                                   base-tas-b that, when feed with chunks defined by
Several column names represent the parameters              sentences, reaches mAP = 0.65 when taking into
passed to the aggregation function.                        account from 4 to 6 best chunks’ relevance scores
“#ch: <number>” is the parameter controlling the           using both the sum or their average.
number of the best chunks considered for computing         We thus select this latter model with the setting
the document ranking score. When <number>=All, it          chunks=sentences, number of chunks per document
means that all chunks are taken into account.              to consider in the matching from 4 to 6 and either sum
The second parameter “avg” is a Boolean controlling if     of scores or their average.
the relevance score is defined as an average of the
chunks’ scores (in that case the parameter is used), or
if it corresponds to their sum (no indication of the       4. Conclusions
parameter appears). More in detail:
“#ch: N (sum)” indicates that the sum of the first N           The originality of the described experience is
best chunks’ scores of each document was computed;         manifold: first of all, the experimentation of LLMs to
“#ch: N (avg)” indicates that the average of the first N   index and retrieve a highly heterogeneous collection
best chunks’ scores of each document was computed;         of documents and their compared evaluation
When N=All it means that all the chunks in the             considering different chunk definitions, similarity
documents are considered.                                  metrics, and last but not least, by evaluating different
Since documents generally consist of long texts with       aggregation strategies of the chunks relevance scores
many chunks we applied also an approach in which           to compute the overall rank of documents. This last
the document is represented by a single virtual            aspect is important in the case the documents are
embedding vector computed as the average of the            long, consisting of many chunks as in our case.
chunks’ vectors. In this case the results of mAP are           A second original contribution is the classification
reported in the column named “Virtual Doc” of Table        of the documents into “fuzzy” overlapping topics,
1.                                                         according to a textual description of each topic which
The last column named “max” reports the best mAP           is used as a natural language query to retrieve the
obtained by any of the documents’ chunks for the           ranked list of documents belonging to the topic to a
given setting in the row.                                  given extent. This approach has been deemed feasible
It can be easily noticed that three distinct models        to be applied for the implementation of the KH in
produce the maximum mAP = 0.64 for different               order to provide public authorities with a tool that can
aid them in searching all documentation they need for               Conference       on     Neural      Information
the UNEP-MAP program.                                               Processing Systems (NIPS'17). Curran
                                                                    Associates Inc., Red Hook, NY, USA, 6000–
Acknowledgements                                                    6010.
                                                               [14] Devlin J., Chang M.W., Lee K., Toutanova K.,
The work has been carried out within the UNEP-MAP                   BERT: Pre-training of Deep Bidirectional
Program of Work 2022-2023 in the framework of the                   Transformers for Language Understanding,
activity of the Information and Communication                       Proc. of NAACL-HLT 2019, pp. 4171–4186.
Regional Activity Centre (INFO/RAC).                           [15] https://github.com/UKPLab/sentence-
                                                                    transformers
References                                                     [16] https://www.nltk.org/
    [1] Bordogna, G., Tagliolato, P., Lotti, A., Minelli,      [17] Yager, R. R. On the fuzzy cardinality of a
         A., Oggioni, A., & Babbini, L. (2023). Report 2            fuzzy set. International Journal of General
         - Semantic Information Retrieval –                         Systems, 35(2), 191–206.,
         Knowledge Hub. Zenodo.                                     https://doi.org/10.1080/03081070500422
         https://doi.org/10.5281/zenodo.10260195                    729, 2006
    [2] Kadhim, A.I. Survey on supervised machine              [18] https://huggingface.co/sentence-
         learning techniques for automatic text                     transformers/msmarco-distilbert-cos-v5
         classification. Artif Intell Rev 52, 273–292          [19] https://huggingface.co/sentence-
         (2019). https://doi.org/10.1007/s10462-                    transformers/all-MiniLM-L6-v2
         018-09677-1                                           [20] https://huggingface.co/sentence-
    [3] Manning C.D., Raghavan P., Schütze H., An                   transformers/msmarco-roberta-base-ance-
         Introduction to Information Retrieval,                     firstp
         Online edition (c) 2009 Cambridge UP, URL             [21] https://huggingface.co/sentence-
         https://nlp.stanford.edu/IR-                               transformers/msmarco-bert-base-dot-v5
         book/pdf/irbookonlinereading.pdf                      [22] https://huggingface.co/sentence-
    [4] Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K.              transformers/msmarco-distilbert-base-tas-
         Zhang, C. Ji, Q. Yan, L. He et al., A                      b
         comprehensive survey on pretrained                    [23] http://vocabularies.unesco.org/thesaurus
         foundation models: A history from bert to             [24] http://fuseki1.get-it.it/inforac/sparql
         chatgpt,” arXiv preprint arXiv:2302.09419,            [25] Beitzel, S.M., Jensen, E.C., Frieder, O. MAP. In:
         2023.                                                      LIU, L., ÖZSU, M.T. (eds) Encyclopedia of
    [5] Kraft, D. H., Bordogna G., Pasi G. Fuzzy Set                Database Systems. Springer, Boston, MA.
         Techniques in Information Retrieval.                       https://doi.org/10.1007/978-0-387-
         (1999).DOI: 10.5281/zenodo.8082923                         39940-9_492 2009
    [6] REMPEC - https://www.rempec.org
    [7] SPA/RAC - https://www.rac-spa.org                   A. Online Resources
    [8] SCP/RAC - http://www.cprac.org                         [26] https://github.com/INFO-RAC/KMP-
    [9] PAP/RAC - https://paprac.org                                library-scraping
    [10] https://www.unep.org/unepmap/resource                 [27] https://github.com/IREA-CNR-
         s/publications?/resources                                  MI/inforac_ground_truth.
    [11] https://wedocs.unep.org/discover?filtertyp            [28] http://rdfdata.get-it.it/inforac/
         e=author&filter_relational_operator=equals
         &filter=UNEP%2FMAP
    [12] Wolf T., Debut L., Sanh V., Chaumond J., et al.,
         HuggingFace's Transformers: State-of-the-
         art     Natural      Language      Processing,
         https://arxiv.org/pdf/1910.03771.pdf
    [13] Ashish Vaswani, Noam Shazeer, Niki Parmar,
         Jakob Uszkoreit, Llion Jones, Aidan N.
         Gomez, Łukasz Kaiser, and Illia Polosukhin.
         2017. Attention is all you need. In
         Proceedings of the 31st International

</pre>