Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities

Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities PaoloTagliolato AcquavivaD'aragona CNR -IREA

via Corti 12 20133 Milano Italy

LorenzaBabbini INFO/RAC UNEP-MAP GloriaBordogna CNR -IREA

via Corti 12 20133 Milano Italy

AlessandroLotti INFO/RAC UNEP-MAP AnnalisaMinelli INFO/RAC UNEP-MAP AlessandroOggioni CNR -IREA

via Corti 12 20133 Milano Italy

/o ISPRA DG-SINA

via Vitaliano Brancati 48 00144 Roma Italy

Design of a Knowledge Hub of Heterogeneous Multisource Documents to support Public Authorities 1613-0073 E3232D4C91F922124150391234219503 GROBID - A machine learning software for extracting information from scholarly documents Knowledge Hub Large Language Models Natural Language Queries Knowledge graph.1

This contribution outlines the design of a Knowledge Hub of heterogeneous documents related to the Mediterranean Action Plan UNEP-MAP of the United Nations Environment Program [1]. The Knowledge Hub is intended to serve as a resource to assist public authorities and users with different backgrounds and needs in accessing information efficiently. Users can either formulate natural language queries or navigate a knowledge graph automatically generated to find relevant documents. The Knowledge Hub is designed based on state-of-the-art Large Language Models. (LLMs) A user-evaluation experiment was conducted, testing publicly available models on a subset of documents using distinct LLMs settings. This step was aimed to identify the best-performing model for further using it to classify the documents with respect to the topics of interest.

Introduction

This contribution reports the feasibility study carried out for the design of a Knowledge Hub (KH) for accessing documents, which is part of the Knowledge Management Platform (KMaP), a platform constituting the unique access point of all knowledge heritage for the United Nations Environmental Program for the Mediterranean Action Plan (UNEP-MAP) [1].

The KH is conceived as an access point to highly heterogeneous multimedia documents distributed on the Web, among the network of United Nations Environmental Program for the Mediterranean Action Plan, about marine studies, political and economic directives, environmental studies and in general as part of UNEP-MAP protocols and activities. For the nature of the contents dealt with in the documents, the hub constitutes a knowledge base for the stakeholders of the Mediterranean Action Plan: The interested public authorities have users with different background knowledge and needs, including politicians, administrators, environmental scientists, projects leaders and citizens, who need to search as well as to navigate the distributed archive.

During the use case analysis, carried out by interviews to some potential stakeholders, it was deemed important that the KH would support users to perform searches by formulating queries in natural language, and would guide them in navigating the collection by providing an organized view of the documents into topics of interest [2].

To this aim, main critical aspects had to be considered to provide feasible solutions: the document collection is highly heterogeneous as far as the genre, some being minutes of meetings while 0000-0002-0261-313X (P. Tagliolato); 0000-0003-3302-6891 (L. Babbini); 0000-0002-6775-753X (G. Bordogna); 0000-0002-4837-4357 (A. Lotti); 0000-0003-1772-0154 (A. Minelli); 0000-0002-7997-219X (A. Oggioni) others being scientific reports, with highly variable lengths, some documents being of one page while others being reports of hundred pages, in different languages with varying formats (mostly being in pdf others in html and jpg). Finally, the identification of the topics made during the use case analysis revealed that it is not so easy to tell apart which documents belong to a topic, being some of them at the cross-road of several topics.

The approach that we deemed flexible to apply for enabling natural language searches was identified as an Information Retrieval system [3] defined based on Large Language Models (LLMs), and specifically on open source pre-trained LLMs [4].

To aid the organization of the documents into the topics we then retrieved natural language descriptions of the topics by simple keywords and conceived these are natural language queries to be submitted to the collection represented in a continuous bag of words space of a pretrained LLM.

This way, all documents belongs to the topics with a distinct relevance rank. This allowed to build a knowledge graph in which each node represents the ranked list of a topic and each edge between a pair of nodes represents the fuzzy intersection list of the two ranked lists [5].

A user-evaluation experiment was conducted, testing publicly available LLMs on a subset of documents using distinct settings. This step aimed to identify the best-performing model for further using it for both implementing the information retrieval module answering natural language queries and classifying documents with respect to the topics. The paper reports the steps of design of the KH and its evaluation experiment for selecting the best model to be applied in the future for documents' classification into topics.

Knowledge Hub design

The first activity performed was the harvesting of the documents from several potential sources of interest.

To this end we relied on the knowledge of a group of experts of the leading institution ISPRA.

Harvesting Documents' Collection

This step was aimed at identifying the documents sources, i.e., the web sites and archives with potentially interesting documents and at carrying out their characterization with respect to some meaningful dimensions [5]. The documents in these information sources are more than 10000, mainly files, and most of them are in PDF format. Most information sources (20 out of 24) contain documents, and 3 of these resources also share images and tables, while only 3 out of 24 provide geographical layers. As far as the resources are concerned, they are dedicated to 3 themes: law, regulation and management of the sea (13 out of 24), pollution (7) and biodiversity (2). Finally, 21 of the classified repositories are open to the public, while the remaining 3 are private or have restricted access. From Regional Marine Pollution Emergency Response Centre for the Mediterranean Sea [6], Regional Activity Centre for Specially Protected Areas [7], Regional Activity Centre for Sustainable Consumption and Production [8], Priority Actions Programme/Regional Activity Centre [9], UNEP-MAP library [10] and UNEP library where the author was marked as UNEP-MAP [11], we harvested, through website scraping, all the documents. For document harvesting, some code has been developed both by CNR-IREA in the R language and from INFO-RAC in Python language [26] freely available under GNU GPL license. To share the files produced for the harvesting process, a GitHub repository was created [27]. The "scraping" folder contains the R and Python scripts developed for scraping, the output of these files is in the "results" folder.

Strategies for enabling documents search

Once the collection was available, the methods of representation and indexing of their content have been selected. It was decided to experiment an up-to-date solution based on state-of-the-art "semantic" indexing methods using continuous bag of words [4]. By this approach users have complete freedom to formulate natural language queries or keywords' queries. In this case the documents are retrieved if their contents are "semantically" close to those of the query. To this end we experimented several LLMs available publicly on hugging face library [12]. All these models imply the representation and management of the "semantics" of information in a document corpus which has been provided as training set. It must be pointed out that, in this context, the term "semantics" is improperly used since the LLMs identify regular patterns in texts based on heuristic statistical inference; thus, instead of "semantics", the term "relatedness" would be more appropriate. This way they learn how to predict missing words in a sentence, or how to continue a sentence, or to answer a query, and, finally, to retrieve relevant documents in an ad hoc retrieval task activated by a user query. Such "semantics" models are the most effective in the case one wants a natural language querying interaction, since they can retrieve documents which do not contain the specific query words, but synonymous terms or concepts related with the query concepts. In our context this approach was the most feasible since we did not have available thesauri for expanding the meaning of terms in the documents, being the documents heterogeneous as far as both their themes and genre. To this end, we have chosen pretrained LLMs that have been set up for the ad hoc retrieval task and based on evolutions of BERT, Bidirectional Encoder Representations from Transformers [13] [14] which is the Google state-of-the-art model using a transformer architecture [13], a deep neural network, with self-attention mechanisms, that allows to keep the context of words into account when creating their representation as embeddings, i.e., as vectors of continuous numeric values in a latent semantic space. Once the LLMs have been selected, we defined the architecture of the KH by specifying the preprocessing phase that our corpus of documents should undergo to become a readable input to the selected models. The formats of the input documents, should be simple text with punctuation marks allowing the identification of single words, i.e., tokens; of sentences, ending with punctuation marks like full stop or semicolon, etc.; and of paragraphs, starting with a new line. So, the non-conforming documents consisting in pdf files had to be "translated" into text. Furthermore, the processing steps have been identified which has implied the selection of the implementation libraries and environment in order to code the whole process. We experimented hybridized techniques, for example, the contents of queries and documents was represented by applying different embedding methods, and the same for the ranking of documents using different similarity measures. Finally, we identified the most suitable open software for the implementation of the components, the indexing, the retrieval and the classification components of the KH. Considering that there are a number of open source IR libraries after a review we selected SentenceTransformer python framework [15] that makes several Hugging face pretrained models available for sentence embeddings, and we exploited also the python library NLTK (Natural Language Toolkit [16]) for managing corpus documents and different tokenization strategies (i.e. the aforementioned subdivisions of documents into chunks, i.e., words, sentences, paragraphs or even n-grams sentences, paragraphs, etc.). For our purposes we deemed meaningful to compute different combinations of pretrained LLMs, documents representations based on different chunks definitions, and matching function either dot product or cosine similarity. Since documents may contain several chunks depending on their length, we experimented several aggregation functions of the chunks relevance scores to compute the overall document relevance score, i.e., the document ranking score. Specifically, we applied a K-NN algorithm aggregation function by increasing the number of the most relevant chunks and by using as metrics the fuzzy document cardinality measure [17]. We have selected the following pre-trained LLMs based on sentence-transformer architectures:

Documents classification into topics

As far as the classification of the document corpus into the topics, during the use case analysis the topics were first identified by the seven keywords accounted for in the UNESCO thesaurus [23], an RDF SKOS concept scheme without definitions, as reported in table 1.

Then we identified "definitions" of each topic keyword in renowned and authoritative sources as reported in table 1, i.e., open domain websites, in the form of textual abstracts. We then enriched the pre-existing thesaurus by adding those definitions in the web of data. The result is available both as linked data and through a SPARQL endpoint [28].

Table 1:

Topics keywords and sources for their definitions as short abstracts

After choosing the best performing model evaluated as explained in the next section, we applied it to classify the whole collection into the topics, by considering the topics' definitions as queries. This way a document can be assigned to multiple topics to a different extent, where in the extent is the relevance score with respect to a topic. The fuzzy intersection of a pair of ranked lists yielded by two topics (computed by their minimum) is the ranked list of documents at the cross-road of both the topics. This way a knowledge graph can be built in which the nodes are the ranked list of the single topics while the edges are the ranked lists of documents at the crossroad of pairs of topics.

User Evaluation Experiment

We have set up an evaluation experiment of the different LLMs by randomly selecting a subset of 50 documents of the collection, engaging 3 users with three distinct backgrounds (a physicist, an environmental scientist and a biologist) who read these documents and formulated 10-30 queries each and for each query identified the list of their respective relevant documents among the 50 ones.

We evaluated some metrics of retrieval effectiveness.

For our purposes we deemed meaningful to compute mean Average Precision (mAP) [25] of different combinations of the 5 pretrained LLMs, documents representations based on different chunks definitions, i.e., sentence, fixed window size and paragraphs, and matching functions (cosine similarity and dot product). The results of the mAP for the tests are reported in the following tables. They differ for the computation of similarity. Table 2 corresponds to cosine similarity, while Table 3 to dot product similarity. The second parameter "avg" is a Boolean controlling if the relevance score is defined as an average of the chunks' scores (in that case the parameter is used), or if it corresponds to their sum (no indication of the parameter appears). More in detail: "#ch: N (sum)" indicates that the sum of the first N best chunks' scores of each document was computed; "#ch: N (avg)" indicates that the average of the first N best chunks' scores of each document was computed; When N=All it means that all the chunks in the documents are considered. Since documents generally consist of long texts with many chunks we applied also an approach in which the document is represented by a single virtual embedding vector computed as the average of the chunks' vectors. In this case the results of mAP are reported in the column named "Virtual Doc" of Table 1.

The last column named "max" reports the best mAP obtained by any of the documents' chunks for the given setting in the row. It can be easily noticed that three distinct models produce the maximum mAP = 0.64 for different settings by using cosine similarity between pairs of embedding vectors. Nevertheless, the most stable model under different input settings (both window and paragraph) and different matching definitions is (b) all-MiniLM-L6-v2. Table 3 reports the mAP values when changing the similarity metric by using the dot product. In this case the best performing model is (e) msmarco-distilbertbase-tas-b that, when feed with chunks defined by sentences, reaches mAP = 0.65 when taking into account from 4 to 6 best chunks' relevance scores using both the sum or their average. We thus select this latter model with the setting chunks=sentences, number of chunks per document to consider in the matching from 4 to 6 and either sum of scores or their average.

Conclusions

The originality of the described experience is manifold: first of all, the experimentation of LLMs to index and retrieve a highly heterogeneous collection of documents and their compared evaluation considering different chunk definitions, similarity metrics, and last but not least, by evaluating different aggregation strategies of the chunks relevance scores to compute the overall rank of documents. This last aspect is important in the case the documents are long, consisting of many chunks as in our case.

A second original contribution is the classification of the documents into "fuzzy" overlapping topics, according to a textual description of each topic which is used as a natural language query to retrieve the ranked list of documents belonging to the topic to a given extent. This approach has been deemed feasible to be applied for the implementation of the KH in order to provide public authorities with a tool that can aid them in searching all documentation they need for the UNEP-MAP program.

Ital-IA 2024: 4th National Conference on Artificial Intelligence, organized by CINI, May 29-30, 2024, Naples, Italy* Corresponding author. † These authors contributed equally. paolo.tagliolatoacquavivadar@ cnr.it (F. Author);

(a) msmarco-distilbert-cos-v5 [18]: it maps sentences & paragraphs to a 768-dimensional dense vector space and was designed for semantic search. It has been trained on 500k (query, answer) pairs from the MS MARCO Passages dataset(Microsoft Machine Reading Comprehension) which is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. (b) all-MiniLM-L6-v2 [19]: it maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. (c) msmarco-roberta-base-ance-firstp [20]: this is a port of the ANCE FirstP Model, which uses a training mechanism to select more realistic negative training instances to the sentencetransformers model: it maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. (d) msmarco-bert-base-dot-v5 [21]: it maps sentences & paragraphs to a 768-dimensional dense vector space and was designed for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset. (e) msmarco-distilbert-base-tas-b [22]: it is a port of the DistilBert TAS-B Model to sentencetransformers model: It maps sentences & paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.

Table 2 :2mAP for different LLMs/chunks and cosine similarityTopics keywordDefinitionsSourceClimate changeUnited NationsMarine biodiversityUNSustainability and blue economy UNPollutionNationalGeographicMarine spatial planningEU commissionFishery and aquacultureFAOGovernanceUN Dev. Progr.

Table 3 :3mAP for different LLMs/chunks and dot-product The first column is the pretrained model used (indicated by the letter used in section 2.2). Second column indicates the chunk type used, either sentence, window/ngram, paragraph; then the size of the input to the model is reported. The other columns report the mAP averaged over all users and all queries by considering different aggregation functions of the chunks relevance scores. Several column names represent the parameters passed to the aggregation function."#ch: <number>" is the parameter controlling the number of the best chunks considered for computing the document ranking score. When <number>=All, it means that all chunks are taken into account.

Acknowledgements

The work has been carried out within the UNEP-MAP Program of Work 2022-2023 in the framework of the activity of the Information and Communication Regional Activity Centre (INFO/RAC).

Report 2 -Semantic Information Retrieval -Knowledge Hub GBordogna PTagliolato ALotti AMinelli AOggioni LBabbini 10.5281/zenodo.10260195 2023 Survey on supervised machine learning techniques for automatic text classification AIKadhim 10.1007/s10462-018-09677-1 Artif Intell Rev 52 2019 An Introduction to Information Retrieval CDManning PRaghavan HSchütze 2009 Cambridge UP Online edition A comprehensive survey on pretrained foundation models: A history from bert to chatgpt QZhou CLi JLi YYu GLiu KWang CZhang QJi LYan He arXiv:2302.09419 2023 arXiv preprint Fuzzy Set Techniques in Information Retrieval DHKraft GBordogna GPasi 10.5281/zenodo.8082923 1999 PAP TWolf LDebut VSanh JChaumond HuggingFace's Transformers: State-of-theart Natural Language Processing Attention is all you need AshishVaswani NoamShazeer NikiParmar JakobUszkoreit LlionJones AidanNGomez ŁukaszKaiser IlliaPolosukhin Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17) the 31st International Conference on Neural Information Processing Systems (NIPS'17)

Red Hook, NY, USA

Curran Associates Inc 2017 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JDevlin MWChang KLee KToutanova Proc. of NAACL-HLT 2019 of NAACL-HLT 2019 On the fuzzy cardinality of a fuzzy set RRYager 10.1080/03081070500422729 International Journal of General Systems 35 2 2006 SMBeitzel ECJensen OFrieder Map 10.1007/978-0-387-39940-9_492 - 39940-9_492 2009 Encyclopedia of Database Systems LLiu MTÖzsu

Boston, MA

Springer <author> <persName><forename type="first">A</forename></persName> </author> <ptr target="https://github.com/INFO-RAC/KMP-library-scraping" /> <imprint/> </monogr> </biblStruct> </listBibl> </div> </back> </text> </TEI>