Symbolic Knowledge Comparison: Metrics and Methodologies for Multi-Agent Systems⋆ Federico Sabbatini1,* , Christel Sirocchi1 and Roberta Calegari2 1 Department of Pure and Applied Sciences, University of Urbino Carlo Bo 2 Department of Computer Science and Engineering (DISI), Alma Mater Studiorum–University of Bologna Abstract In multi-agent systems, understanding the similarities and differences in agents’ knowledge is essential for effective decision-making, coordination, and knowledge sharing. Current similarity metrics like cosine similarity, Jaccard similarity, and BERTScore are often too generic for comparing knowledge bases, overlooking critical aspects such as overlapping and fragmented boundaries, and varying domain densities. This paper introduces new specific similarity metrics for comparing knowledge bases, represented via symbolic knowledge. Our method compares local explanations of individual instances, preserving computational resources and providing a comprehensive evaluation of knowledge similarity. This approach addresses the limitations of existing metrics, enhancing the functionality and efficiency of multi-agent systems. Keywords Multi-agent systems, Knowledge similarity, Symbolic knowledge 1. Introduction In multi-agent systems, the knowledge owned by each agent is the key to enabling autonomous decision-making [1]. Agents rely on their individual and collective knowledge to steer and react to complex environments, making understanding similarities and differences in their knowledge a fundamental aspect of their interaction. This understanding is pivotal for different appli- cations within the system, including collaborative decision-making, knowledge sharing, and coordination [2]. To effectively compare the knowledge of different agents, similarity measurement metrics and techniques are needed, allowing agents to assess the extent to which their knowledge overlaps, identify areas of agreement and disagreement, and optimise their interactions accordingly [3]. Some examples include collaborative decision-making where agents can use similarity metrics to evaluate the compatibility of their internal knowledge or beliefs. By comparing their knowledge representations, agents can pinpoint areas of alignment and conflict, facilitating negotiations and enabling collaborative decisions that capitalise on the strengths and complementary aspects of each agent’s knowledge. As for knowledge sharing, similarity metrics help determine which WOA 2024: 25th Workshop “From Objects to Agents”, July 8–10, 2024, Forte di Bard (AO), Italy ⋆ Original research paper. * Corresponding author. $ f.sabbatini1@campus.uniurb.it (F. Sabbatini); c.sirocchi2@campus.uniurb.it (C. Sirocchi); roberta.calegari@unibo.it (R. Calegari)  0000-0002-0532-6777 (F. Sabbatini); 0000-0002-5011-3068 (C. Sirocchi); 0000-0003-3794-2942 (R. Calegari) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings pieces of knowledge are most relevant to share amongst agents. High similarity scores indicate significant overlap, suggesting that certain knowledge may be redundant if shared. In resource allocation and task assignment, especially in multi-agent systems with diverse expertise and capabilities, similarity metrics can match agents to tasks based on the similarity between their knowledge and the task requirements. Similarity metrics also guide adaptive learning and knowledge transfer processes by helping agents identify peers with similar knowledge profiles. Also, similarity metrics can help identify common ground when agents encounter conflicting beliefs or preferences. By focusing on areas of high similarity, agents can find mutually acceptable solutions and build consensus, facilitating smoother negotiation and collaboration. State-of-the-art metrics to measure the similarity between two objects, e.g., vectors, exist, such as cosine similarity [4], Jaccard similarity [5], and semantic similarity using contextual embeddings like BERTScore [6]. However, these metrics are often generic distances, not specific for comparing knowledge bases, and thus they do not take into account some fundamental aspects, such as overlapping boundaries or fragmented, or regions with varying domain densities in their measurements. These are aspects that should be considered to effectively measure similarity between knowledge bases. Accordingly, in this paper, we define new similarity metrics designed to overcome these limitations. Building on the premise that symbolic knowledge can be expressed through logical rules, e.g., approximated as hypercubes [7, 8], we propose similarity metrics for comparing two symbolic knowledge bases in this form. Instead of comparing each rule with all possible others, which would waste computational resources, our approach compares local explanations provided for the same instances by two specific knowledge pieces1 . Since our approach considers explanations at a local level and aggregates similarity measurements across instances, it offers a comprehensive evaluation of knowledge similarity, addressing several limitations of existing metrics. 2. Explanation-Based Similarity Metrics In this section, we formally present four formulations of metrics designed to express the similarity between pairs of knowledge pieces, highlighting their differences. These metrics are based on the following assumption: the similarity between two distinct knowledge pieces can be assessed by measuring the pairwise similarity of their local explanations when queried with the same instance. The core idea is to gather these similarity measurements across a sufficiently large set of instances and then aggregate them, for example, by averaging. Leveraging local explanations simplifies the similarity assessment for several reasons. For in- stance, when comparing knowledge bases composed of predictive rules, it is necessary to develop a strategy for comparing these rules. Comparing all possible pairs of rules is computationally infeasible and often insignificant, as a rule may be similar to one but very different from all others in the knowledge base. This issue can be mitigated by considering only pairs of overlapping/adja- cent/close rules, though defining formal notions for these concepts is complex. Exploiting local explanations bypasses these challenges and provides reliable proxies for knowledge similarity, enabling the creation of computationally feasible metrics. 1 We point out here that we use the terms “instance”, “sample” and “individual” as synonyms representing a single entry of the data set at hand 2.1. Notation Let us briefly introduce the notation used in the following sections. Let 𝒮 represent a data set con- sisting of 𝑛 pairs (𝑥𝑖 , 𝑦𝑖 ), where 𝑥𝑖 is an instance described by 𝑘 input attributes (𝑥1𝑖 , 𝑥2𝑖 , . . . , 𝑥𝑘𝑖 ), and 𝑦𝑖 is the corresponding outcome. Formally: 𝒮 = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )} , 𝑥𝑖 = (𝑥1𝑖 , 𝑥2𝑖 , . . . , 𝑥𝑘𝑖 ). Let 𝐷𝑥 and 𝐷𝑦 be the domains of the instances’ inputs and outputs, respectively: (𝑥𝑖 ∈ 𝐷𝑥 ) ∧ (𝑦𝑖 ∈ 𝐷𝑦 ) , ∀𝑖 = 1, 2, . . . , 𝑛. Let us assume the existence of a predictive function 𝑓 defined as follows: 𝑓 : 𝐷𝑥 → 𝐷𝑦 , 𝑓 (𝑥𝑖 ) = 𝑦^𝑖 , where 𝑦^𝑖 is the value predicted by 𝑓 for the instance 𝑥𝑖 . Any entity with predictive capabilities, such as a machine learning model or symbolically represented knowledge, can be modelled with 𝑓 . In this paper, we consider without loss of generality the comparison between two symbolic knowledge bases, e.g., knowledge encoded by domain experts or obtained with symbolic knowledge-extraction (SKE) techniques applied to black-box machine learning predictors [9]. Nonetheless, our approach can be employed to compare any pair of objects providing local explanations. Finally, let us assume the existence of an explaining function 𝑒𝑓 , which maps data set instances 𝑥𝑖 to their corresponding local explanations 𝜉𝑖𝑓 , derived by analysing 𝑓 : 𝑒𝑓 : 𝐷𝑥 → 𝐷𝑥 , 𝑒𝑓 (𝑥𝑖 ) = 𝜉𝑖𝑓 , 𝑥𝑖 ∈ 𝜉𝑖𝑓 ⊆ 𝐷𝑥 , ∀𝑖 = 1, 2, . . . , 𝑛. Specifically, 𝜉𝑖𝑓 defines a subregion within the domain 𝐷𝑥 that encloses 𝑥𝑖 and provides a local explanation for the prediction 𝑓 (𝑥𝑖 ). The similarity metrics introduced in the following subsections are represented by the operators 𝒮 𝒮 𝒮 𝒮 ≈, ∼, ≈⊂ , and ∼⊂ . These metrics are functions evaluated on the instances of a data set 𝒮, mapping pairs of predictive functions 𝑓1 and 𝑓2 to the [0, 1] interval: 𝒮 𝒮 𝒮 𝒮 ≈, ∼, ≈⊂ , ∼⊂ : (𝐷𝑥 → 𝐷𝑦 ) × (𝐷𝑥 → 𝐷𝑦 ) → [0, 1]. The symbol ≈ signifies approximate equality between operands, while ∼ denotes similarity that is not necessarily bound to equality. Additionally, the subscript ⊂ indicates asymmetric relations, while its absence indicates symmetric ones. These symbols are used as infix binary operators 𝒮 (e.g., 𝑓1 ∼ 𝑓2 ). The specific task at hand determines the nuances of similarity assessment, which must satisfy specific sets of properties. For example, when comparing a knowledge base encoded by human experts with symbolic knowledge extracted via SKE, one may desire the latter to precisely align with the expert knowledge, focusing on the exact coincidence between corresponding decision boundaries. Conversely, a different scenario may involve replacing a knowledge base with a newer, more accurate one without altering the decision boundaries that led to correct predictions in the past (backward compatibility; [10]). In this case, the similarity assessment should only consider the subset of data set instances that received the expected outcomes from the current knowledge piece. Accordingly, we define two subsets of 𝒮 suitable for similarity assessment based on user needs: 𝒮 𝑓1 =𝑓2 and 𝒮 𝑓 . Definition 1 (Congruence set). The congruence set 𝒮 𝑓1 =𝑓2 is defined as the subset of 𝒮 instances receiving the same predictions from both 𝑓1 and 𝑓2 , regardless of the prediction correctness: 𝒮 𝑓1 =𝑓2 = {𝑥𝑖 | (𝑥𝑖 , 𝑦𝑖 ) ∈ 𝒮 ∧ 𝑓1 (𝑥𝑖 ) = 𝑓2 (𝑥𝑖 )} . Definition 2 (Backward compatibility set). The backward compatibility set 𝒮 𝑓 is defined as the subset of 𝒮 instances receiving correct predictions from 𝑓 : 𝒮 𝑓 = {𝑥𝑖 | (𝑥𝑖 , 𝑦𝑖 ) ∈ 𝒮 ∧ 𝑓 (𝑥𝑖 ) = 𝑦𝑖 } . If necessary, the intersection of the two sets can be taken to fulfill both definitions: Definition 3 (Backward-compatible congruence set). The backward-compatible congruence set 𝒮 𝑓1 ,𝑓2 is defined as the subset of 𝒮 instances receiving correct predictions from both 𝑓1 and 𝑓2 : 𝒮 𝑓1 ,𝑓2 = {𝑥𝑖 | (𝑥𝑖 , 𝑦𝑖 ) ∈ 𝒮 ∧ 𝑓1 (𝑥𝑖 ) = 𝑓2 (𝑥𝑖 ) = 𝑦𝑖 } . In the following we use the symbol 𝒮 ⋆ as a generic notation for any of these subsets. 2.2. Perfect Overlapping Two knowledge bases are considered perfectly overlapping if they provide identical local explana- tions for the same queries. This definition can be extended to produce not just a Boolean indicator of similarity, but also a numerical evaluation of the degree of overlap between the explanations provided for a set of instances. Definition 4 (Perfect overlapping). Let 𝑓1 , 𝑓2 be two functions representing the predictive pro- cess of two symbolic knowledge bases and 𝑒𝑓1 , 𝑒𝑓2 be the explaining functions for 𝑓1 and 𝑓2 , respectively. The knowledge bases are perfectly overlapping if the explanations provided by them 𝒮⋆ for each instance 𝑥𝑖 of 𝒮 ⋆ are perfectly overlapping, i.e., if 𝑓1 ≈ 𝑓2 = 1: 1 ∑︁ 𝑣𝑜𝑙𝑢𝑚𝑒 𝑒𝑓1 (𝑥𝑖 ) ∩ 𝑒𝑓2 (𝑥𝑖 ) (︀ )︀ 𝒮⋆ 𝑓1 ≈ 𝑓2 = ⋆ , |𝒮 | ⋆ 𝑣𝑜𝑙𝑢𝑚𝑒 (𝑒𝑓1 (𝑥𝑖 ) ∪ 𝑒𝑓2 (𝑥𝑖 )) 𝑥𝑖 ∈𝒮 where 𝑣𝑜𝑙𝑢𝑚𝑒(·) is a function expressing the volume of an input space subregion. 30 xi 30 xi f1 f1 i i f2 f2 20 i 20 i f1 f2 f1 f2 d2 d2 i i i i f1 f2 f1 f2 10 i i 10 i i 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 d1 d1 𝒮 (a) Perfect overlapping (≈). 30 xi 30 xi xj f1 f2 xj f1 f2 i i i i xj f1 f2 xj f1 f2 20 i i 20 i i d2 d2 f1 f1 i i 10 f2 10 f2 i i 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 d1 d1 𝒮 (b) Sensitive overlapping (∼). 30 xi 30 xi f1 f1 i i f2 f2 20 i 20 i f1 f2 f1 f2 d2 d2 i i i i 10 10 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 d1 d1 𝒮 (c) Perfect fragmentation (≈⊂ ). 30 xi 30 xi xj f1 xj f1 i i xj f1 f2 xj f1 f2 20 i i 20 i i d2 d2 f1 f1 i i 10 f2 10 f2 i i 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 d1 d1 𝒮 (d) Sensitive fragmentation (∼⊂ ). Figure 1: Examples of the proposed similarity metrics associated with high scores (left) and low scores (right). 30 xi 30 xi xj Dx xj f1 f1 20 i f2 20 i f1 d2 d2 i j f2 10 10 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 d1 d1 (a) Explanations with non-uniform density. (b) Fragmented explanations. Figure 2: Different scenarios of overlapping explanations. Definition 4 essentially represents an intersection-over-union operation, also known as Jaccard similarity, applied to pairs of local explanations for each instance in a given data set. This metric reaches a score of exactly 1 only if the explanations are perfectly equivalent, meaning 𝒮 the intersection of the explanations matches their union. The calculation of ≈ is illustrated in Figure 1a. While this definition serves as an ideal proxy for detecting perfect overlap between distinct knowledge bases, it presents two potential issues. The first issue arises when applying the knowledge bases to a data set with a non-homogeneous sample distribution, as illustrated in Figure 2a. In the figure, explanations 𝜉𝑖𝑓1 and 𝜉𝑖𝑓2 for the instance 𝑥𝑖 ∈ 𝐷𝑥 overlap but are not identical. The intersection-over-union measurement between 𝜉𝑖𝑓1 and 𝜉𝑖𝑓2 would yield a low score, indicating low perfect overlap for 𝑓1 and 𝑓2 . However, the intersection of the two explanations includes the most relevant portion of the explanation union, specifically the region with the highest sample density 𝑥𝑗 ∈ 𝐷𝑥 . Thus, even if the explanations are not identical, the regions of the input space within the union but outside the intersection may be considered less significant, and therefore less penalising in the similarity assessment. This concept is formalised into a scoring metric in Definition 5. The second issue arises when comparing a knowledge base providing many distinct explana- tions with small coverage (𝑓1 ) to another generating fewer explanations with larger coverage (𝑓2 ), as illustrated in Figure 2b. In the figure, two instances 𝑥𝑖 and 𝑥𝑗 ∈ 𝐷𝑥 have different local explanations 𝜉𝑖𝑓1 and 𝜉𝑗𝑓1 , respectively, according to knowledge base 𝑓1 , while they share the same explanation 𝜉⋆𝑓2 according to knowledge base 𝑓2 . However, 𝜉⋆𝑓2 essentially represents the union of 𝜉𝑖𝑓1 and 𝜉𝑗𝑓1 . Thus, the perfect overlap score for 𝑓1 and 𝑓2 would be low, even though the knowledge bases are nearly identical except for the fragmentation in the explanations provided by 𝑓1 . Since this fragmentation may not be a significant factor in assessing knowledge similarity in some applications, we formalise a scoring metric accordingly in Definition 6. 2.3. Sensitive Overlapping The definition of perfect overlap can be relaxed to a notion of sensitive overlap, which measures the extent to which a pair of knowledge bases produce overlapping explanations in the most relevant subregions of the input space when queried with a set of instances. Definition 5 (Sensitive overlapping). Let 𝑓1 , 𝑓2 be two functions representing the predictive process of two symbolic knowledge bases and 𝑒𝑓1 , 𝑒𝑓2 be the explaining functions for 𝑓1 and 𝑓2 , respectively. The knowledge bases are sensitively overlapping if the explanations provided by them for each instance 𝑥𝑖 of 𝒮 ⋆ are overlapping in the input space regions characterised by high 𝒮⋆ sample density, i.e., if 𝑓1 ∼ 𝑓2 = 1: 1 ∑︁ | 𝑥𝑗 | 𝑥𝑗 ∈ 𝑒𝑓1 (𝑥𝑖 ) ∩ 𝑒𝑓2 (𝑥𝑖 ) | {︀ }︀ 𝒮⋆ 𝑓1 ∼ 𝑓2 = ⋆ . ⋆ | {𝑥𝑗 | 𝑥𝑗 ∈ 𝑒 (𝑥𝑖 ) ∪ 𝑒 (𝑥𝑖 )} | |𝒮 | 𝑓1 𝑓2 𝑥𝑖 ∈𝒮 Definition 5 employs an intersection-over-union operation applied to pairs of local explanations, evaluated based on the number of enclosed data set samples rather than the explanation volumes. The metric reaches a score of 1 only if all data set samples are consistently enclosed within the 𝒮 intersection of the examined explanations, even if their union is larger. The calculation of ∼ is illustrated in Figure 1b. 2.4. Perfect Fragmentation The definition of perfect overlap can also be relaxed into a notion of perfect fragmentation, representing the extent to which a knowledge base provides explanations that are perfectly enclosed within broader explanations generated by a different knowledge base when both are queried with the same instances. Definition 6 (Perfect fragmentation). Let 𝑓1 , 𝑓2 be two functions representing the predictive process of two symbolic knowledge bases and 𝑒𝑓1 , 𝑒𝑓2 be the explaining functions for 𝑓1 and 𝑓2 , respectively. The knowledge base 𝑓1 is perfectly fragmented with respect to the knowledge 𝑓2 if the explanation provided by 𝑓1 for each instance 𝑥𝑖 of 𝒮 ⋆ is entirely enclosed within the 𝒮⋆ explanation of 𝑓2 for the same instance, i.e., if 𝑓1 ≈ ⊂ 𝑓2 = 1: 1 ∑︁ 𝑣𝑜𝑙𝑢𝑚𝑒 𝑒𝑓1 (𝑥𝑖 ) ∩ 𝑒𝑓2 (𝑥𝑖 ) (︀ )︀ 𝒮⋆ 𝑓1 ≈ ⊂ 𝑓2 = ⋆ , |𝒮 | ⋆ 𝑣𝑜𝑙𝑢𝑚𝑒 (𝑒𝑓1 (𝑥𝑖 )) 𝑥𝑖 ∈𝒮 where 𝑣𝑜𝑙𝑢𝑚𝑒(·) has the same meaning as in Definition 4. Definition 6 is consequently more adaptable than Definition 4, as it compares the intersection between explanations with a single explanation rather than with their union. 𝒮 The computation of ≈⊂ , depicted in Figure 1c, is asymmetrical and results in a score of 1 only if all the data set samples under consideration receive an explanation from the first operand that is entirely enclosed within the broader explanation generated by the second operand. 2.5. Sensitive Fragmentation The concepts of sensitive overlap and perfect fragmentation can be combined into the definition of sensitive fragmentation, which assesses the similarity by considering the extent to which knowledge generates explanations whose most relevant subregion is perfectly enclosed within the broader explanation provided by a different knowledge. Definition 7 (Sensitive fragmentation). Let 𝑓1 , 𝑓2 be two functions representing the predictive process of two symbolic knowledge bases and 𝑒𝑓1 , 𝑒𝑓2 be the explaining functions for 𝑓1 and 𝑓2 , respectively. The knowledge base 𝑓1 is sensitively fragmented with respect to the knowledge 𝑓2 if the explanation provided by 𝑓1 for each instance 𝑥𝑖 of 𝒮 ⋆ is overlapping with the corresponding explanation provided by 𝑓2 in the input space regions characterised by high sample density, i.e., 𝒮⋆ if 𝑓1 ∼ ⊂ 𝑓2 = 1: 1 ∑︁ | 𝑥𝑗 | 𝑥𝑗 ∈ 𝑒𝑓1 (𝑥𝑖 ) ∩ 𝑒𝑓2 (𝑥𝑖 ) | {︀ }︀ 𝒮⋆ 𝑓1 ∼ ⊂ 𝑓2 = ⋆ . |𝒮 | 𝑥𝑖 ∈𝒮⋆ | {𝑥𝑗 | 𝑥𝑗 ∈ 𝑒𝑓1 (𝑥𝑖 )} | Definition 7 represents the least rigid metric amongst those proposed in this study, as it compares the intersection of explanations with a single explanation (rather than their union) and evaluates based on the number of enclosed data instances (rather than volumes). The sensitive fragmentation metric is asymmetrical and yields a score of 1 only when all the data set samples enclosed within the explanation provided by the first operand are also enclosed within the explanation provided by the second operand, for every pair of explanations produced 𝒮 for the considered data set instances. The calculation of ∼⊂ is illustrated in Figure 1d. 3. Experiments We conducted experiments using the presented similarity metrics on the original Wisconsin Breast Cancer data set [11], which comprises nine input features ranging from 1 to 10 and represents cases of breast cancer. The data set’s binary output class indicates whether the tumour is benign or malignant. Our objective was to compare existing knowledge bases in the literature with alternatives obtained by applying SKE techniques. We compared these based on accuracy, readability, and completeness of the knowledge pieces against the data set ground truth, through both individual evaluations and the 𝐹 𝑖𝑅𝑒 aggregating scoring metric [12]. Additionally, we measured the similarity between the extracted knowledge bases and those existing in the literature, taken as reference, as described in this study. The knowledge bases used as reference are from the works of Duch et al. (2001) and Hayashi and Nakano (2015). The corresponding classification rules are displayed in Listings 1 and 2, respectively. Notably, both sets of rules base predictions on two input features: “Bare Nuclei” (BN) and “Clump Thickness” (CT). Decision boundaries identified by these knowledge bases are illustrated in Figure 3 as coloured rectangles. The G RID E X SKE algorithm [15] was utilised to extract symbolic knowledge from a gradient boosting (GB) classifier trained on the WBC data set. The hyper-parameters of the GB, namely 10 Benign 10 Benign Malignant Malignant 8 Duch, Benign 8 Hayashi, Benign Clump thickness Clump thickness Duch, Malignant Hayashi, Malignant 6 GridEx, Benign 6 GridEx, Benign GridEx, Malignant GridEx, Malignant 4 4 2 2 2 4 6 8 10 2 4 6 8 10 Bare nuclei Bare nuclei (a) Duch, Adamczak, and Grabczewski + G RID E X3. (b) Hayashi and Nakano + G RID E X3. 10 Benign 10 Benign Malignant Malignant 8 Duch, Benign 8 Hayashi, Benign Clump thickness Clump thickness Duch, Malignant Hayashi, Malignant 6 GridEx, Benign 6 GridEx, Benign GridEx, Malignant GridEx, Malignant 4 4 2 2 2 4 6 8 10 2 4 6 8 10 Bare nuclei Bare nuclei (c) Duch, Adamczak, and Grabczewski + G RID E X6. (d) Hayashi and Nakano + G RID E X6. 10 Benign 10 Benign Malignant Malignant 8 Duch, Benign 8 Hayashi, Benign Clump thickness Clump thickness Duch, Malignant Hayashi, Malignant 6 GridEx, Benign 6 GridEx, Benign GridEx, Malignant GridEx, Malignant 4 4 2 2 2 4 6 8 10 2 4 6 8 10 Bare nuclei Bare nuclei (e) Duch, Adamczak, and Grabczewski + G RID E X12. (f) Hayashi and Nakano + G RID E X12. Figure 3: Comparison of decision boundaries of the knowledge rules proposed by Duch, Adamczak, and Grabczewski (2001) and Hayashi and Nakano (2015) for the original WBC data set (highlighted with coloured backgrounds) with the knowledge extracted using multiple G RID E X instances applied to a gradient boosting classifier trained on the same data set (represented by hatched regions). Only the two input features used to make predictions on the data set are displayed. Samples are depicted as circles indicating the quantity of positive and negative classes, with the radius of the circles corresponding to the count of instances across both classes. Notably, denser background colours signify overlapping rules, only present in the knowledge base by Duch, Adamczak, and Grabczewski (left panels), while small uncovered regions are considered negligible due to the absence of training samples in the knowledge extracted with G RID E X12 (bottom panels). Listing 1 Knowledge base adapted from Duch, Adamczak, and Grabczewski (2001) for the WBC data set. M a l i g n a n t i f BN > 5 . 5 M a l i g n a n t i f CT > 6 . 5 B e n i g n i f BN <= 5 . 5 ∧ CT <= 6 . 5 Listing 2 Knowledge base adapted from Hayashi and Nakano (2015) for the WBC data set. B e n i g n i f BN <= 1 . 5 M a l i g n a n t i f BN > 1 . 5 ∧ CT > 4 . 5 M a l i g n a n t i f BN > 6 . 5 ∧ CT <= 4 . 5 B e n i g n i f 1 . 5 < BN <= 6 . 5 ∧ CT <= 4 . 5 Listing 3 Knowledge base extracted with G RID E X3. B e n i g n i f BN <= 4 . 0 ∧ CT <= 5 . 5 M a l i g n a n t i f BN > 4 . 0 ∧ CT <= 5 . 5 M a l i g n a n t i f CT > 5 . 5 Listing 4 Knowledge base extracted with G RID E X6. B e n i g n i f BN <= 4 . 0 ∧ 4 . 0 < CT <= 7 . 0 M a l i g n a n t i f BN <= 4 . 0 ∧ CT > 7 . 0 M a l i g n a n t i f BN > 7 . 0 ∧ CT > 7 . 0 B e n i g n i f BN <= 7 . 0 ∧ CT <= 4 . 0 M a l i g n a n t i f BN > 7 . 0 ∧ CT <= 7 . 0 M a l i g n a n t i f 4 . 0 < BN <= 7 . 0 ∧ CT > 4 . 0 the number of estimators and learning rate, were tuned via a grid search and set to 133 and 0.2033, respectively. The GB was trained by using 75% of the WBC data instances as the training set, while the remaining samples were reserved for the test set, resulting in an observed test accuracy of 0.99 for the trained model. Several G RID E X instances with varying parameters were applied to the trained GB. We leveraged the implementation provided within the PS Y KE framework [16, 17, 18]. Three instances, employing an adaptive splitting strategy with a recursion depth of 1, were chosen to generate knowledge bases containing 3, 6, and 12 items, respectively, for subsequent analysis and comparison with reference knowledge pieces. These instances are henceforth referred to as G RID E X3, G RID E X6, and G RID E X12, respectively. The corresponding classification rules are summarised in Listings 3 to 5, and their decision boundaries are depicted in Figure 3 as hatched regions superimposed onto the reference knowledge boundaries. Quality evaluations regarding classification accuracy, human readability (quantified as knowl- edge size), and completeness (represented by the percentage of covered test samples; [19]) for all Listing 5 Knowledge base extracted with G RID E X12. B e n i g n i f BN <= 2 . 3 ∧ 4 . 6 < CT <= 6 . 4 B e n i g n i f 6 . 1 < BN <= 7 . 4 ∧ 6 . 4 < CT <= 8 . 2 M a l i g n a n t i f 6 . 1 < BN <= 7 . 4 ∧ CT > 8 . 2 M a l i g n a n t i f BN > 8 . 7 ∧ CT > 8 . 2 M a l i g n a n t i f 6 . 1 < BN <= 8 . 7 ∧ CT <= 2 . 8 B e n i g n i f BN <= 3 . 6 ∧ 2 . 8 < CT <= 4 . 6 M a l i g n a n t i f 3 . 6 < BN <= 6 . 1 ∧ 2 . 8 < CT <= 4 . 6 B e n i g n i f BN <= 6 . 1 ∧ CT <= 2 . 8 M a l i g n a n t i f BN > 8 . 7 ∧ CT <= 8 . 2 M a l i g n a n t i f 7 . 4 < BN <= 8 . 7 ∧ CT > 2 . 8 M a l i g n a n t i f 2 . 3 < BN <= 7 . 4 ∧ 4 . 6 < CT <= 6 . 4 M a l i g n a n t i f BN <= 6 . 1 ∧ CT > 6 . 4 Table 1 Quality assessments for the knowledge bases adopted in our experiments. Knowledge Accuracy Size Coverage 6-𝐹 𝑖𝑅𝑒 Duch et al. 0.94 3 1.00 0.061 Hayashi et al. 0.94 4 1.00 0.062 G RID E X3 0.93 3 1.00 0.070 G RID E X6 0.96 6 1.00 0.045 G RID E X12 0.94 12 1.00 0.131 knowledge pieces utilised in the experiments are detailed in Table 1 for the test set (with the best values highlighted in bold). Additionally, we introduced the 𝐹 𝑖𝑅𝑒 metric with a fidelity/read- ability trade-off parameter 𝜓 = 6 (6-𝐹 𝑖𝑅𝑒) to offer a more concise quality assessment. It is worth noting that high-quality knowledge corresponds to low 𝐹 𝑖𝑅𝑒 scores, and higher 𝜓 values prioritise predictive accuracy while disregarding readability. Furthermore, similarity assessments between the extracted and reference knowledge bases, evaluated on the backward-compatible congruence set (Definition 3), are summarised in Table 2. It is important to mention that the robustness of the results presented in Tables 1 and 2 has been confirmed through 10-fold cross- validation repeated 10 times. Given the low variability of the results, we provide here only the average values without their corresponding standard deviations. Amongst the reference knowledge bases considered, the one presented by Duch et al. exhibits identical accuracy and coverage compared to that offered by Hayashi and Nakano. However, the former comprises fewer items, leading to knowledge that is more readable and consequently of higher quality. This is evidenced by a lower 6-𝐹 𝑖𝑅𝑒 score, as anticipated. Similarly, amongst the extracted knowledge bases, the one provided by G RID E X6 emerges as the best according to the 𝐹 𝑖𝑅𝑒 score, boasting the highest accuracy and completeness, despite its suboptimal size. Indeed, any loss in human readability is offset by the gain in accuracy. This Table 2 Similarity measurements between the G RID E X outputs (𝑓1 ) and the reference knowledge bases (𝑓2 ). 𝑓1 ↓ 𝑓2 → Duch et al. Hayashi et al. 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 G RID E X3 𝑓1 ≈ 𝑓2 = 0.48 𝑓1 ≈ 𝑓2 = 0.34 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ∼ 𝑓2 = 0.82 𝑓1 ∼ 𝑓2 = 0.86 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ≈⊂ 𝑓2 = 0.83 𝑓1 ≈⊂ 𝑓2 = 0.46 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ∼⊂ 𝑓2 = 0.92 𝑓1 ∼⊂ 𝑓2 = 0.92 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 G RID E X6 𝑓1 ≈ 𝑓2 = 0.34 𝑓1 ≈ 𝑓2 = 0.14 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ∼ 𝑓2 = 0.50 𝑓1 ∼ 𝑓2 = 0.44 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ≈⊂ 𝑓2 = 0.84 𝑓1 ≈⊂ 𝑓2 = 0.42 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ∼⊂ 𝑓2 = 0.99 𝑓1 ∼⊂ 𝑓2 = 0.87 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 G RID E X12 𝑓1 ≈ 𝑓2 = 0.23 𝑓1 ≈ 𝑓2 = 0.08 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ∼ 𝑓2 = 0.40 𝑓1 ∼ 𝑓2 = 0.32 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ≈⊂ 𝑓2 = 0.97 𝑓1 ≈⊂ 𝑓2 = 0.42 𝒮 𝑓1 ,𝑓2 𝒮 𝑓1 ,𝑓2 𝑓1 ∼⊂ 𝑓2 = 1.00 𝑓1 ∼⊂ 𝑓2 = 0.90 knowledge piece also outperforms the reference ones for the same reasons. Upon comparing the knowledge bases of G RID E X6 (𝑓1 ) and Duch et al. (𝑓2 ), it becomes apparent that their decision boundaries are not identical. This enables G RID E X to achieve superior predictive performance. 𝒮 𝑓1 ,𝑓2 However, the diversity in boundaries is reflected in a low perfect-overlapping score: 𝑓1 ≈ 𝑓2 = 0.34. Given the distribution of data set samples across the classification rules, the sensitive- 𝒮 𝑓1 ,𝑓2 overlapping score is also low: 𝑓1 ∼ 𝑓2 = 0.50. From these perspectives, the knowledge bases are dissimilar, and substituting the reference knowledge with the extracted one does not ensure coherence and continuity. A more thorough examination of the disparities between the two knowledge bases reveals that their boundaries differ because G RID E X6 comprises more, albeit smaller, classification rules. Consequently, the perfect-fragmentation score is elevated, as the G RID E X6 rules are predomi- 𝒮 𝑓1 ,𝑓2 nantly enclosed within the reference ones: 𝑓1 ≈⊂ 𝑓2 = 0.84. Given that non-overlapping rule regions typically exhibit a sparse sample density, it follows that the sensitive-fragmentation score 𝒮 𝑓1 ,𝑓2 is even higher: 𝑓1 ∼⊂ 𝑓2 = 0.99. Hence, continuity and coherence are maintained if preserving boundaries is not imperative. In summary, aligning with the visual examination, it can be concluded that the knowledge base of G RID E X6 does not exhibit perfect or sensitive overlap with the reference provided by Duch et al.. However, it is nearly perfectly fragmented compared to the latter, earning an optimal score when assessing its degree of sensitive fragmentation. 4. Discussion and conclusion Our study introduces novel similarity metrics tailored for comparing knowledge bases within multi-agent systems, addressing the critical need of effective knowledge assessment in au- tonomous decision-making environments. Through our experiments and analyses, we have demonstrated the practical applicability and significance of these metrics in enhancing various aspects of multi-agent systems’ functionality and efficiency. The results of our experiments highlight the utility of our proposed similarity metrics in facili- tating collaborative decision-making, knowledge sharing, coordination, and resource allocation within multi-agent systems. Specifically, our metrics enable agents to assess the compatibility of their internal knowledge, identify areas of agreement and conflict, and optimise their interactions accordingly. Moreover, they aid in determining which pieces of knowledge are most relevant to share amongst agents and match agents to tasks based on their knowledge profiles and task requirements. In conclusion, our study underscores the importance of effective knowledge assessment in multi-agent systems and introduces novel similarity metrics designed to meet this need. By leveraging these metrics, agents can better understand the landscape of their collective knowledge, leading to more effective collaboration, decision-making, and coordination within the system. We believe that our proposed metrics represent a significant step forward in advancing the capabilities of multi-agent systems and pave the way for further research and development in this field. Acknowledgments This work has been supported by the EU ICT-48 2020 project TAILOR (No. 952215) and the European Union’s Horizon Europe AEQUITAS research and innovation programme under grant number 101070363. References [1] V. Marik, M. Pechoucek, O. Štepánková, Social knowledge in multi-agent systems, Multi- Agent Systems and Applications: 9th ECCAI Advanced Course, ACAI 2001 and Agent Link’s 3rd European Agent Systems Summer School, EASSS 2001 Prague, Czech Republic, July 2–13, 2001 Selected Tutorial Papers 9 (2001) 211–245. [2] P. Panzarasa, N. R. Jennings, T. J. Norman, Formalizing collaborative decision-making and practical reasoning in multi-agent systems, Journal of logic and computation 12 (2002) 55–117. [3] S. Harispe, S. Ranwez, S. Janaqi, J. Montmain, Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis, arXiv preprint arXiv:1310.1285 (2013). [4] C. D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Cambridge University Press, 2008. URL: https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf. doi:10. 1017/CBO9780511809071. [5] A. H. Murphy, The Finley affair: A signal event in the history of forecast verification, Weather and forecasting 11 (1996) 3–20. [6] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with BERT, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL: https://openreview.net/forum?id=SkeHuCVFDr. [7] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, Hypercube-based methods for symbolic knowledge extraction: Towards a unified model, in: A. Ferrando, V. Mascardi (Eds.), WOA 2022 – 23rd Workshop “From Objects to Agents”, volume 3261 of CEUR Workshop Proceedings, Sun SITE Central Europe, RWTH Aachen University, 2022, pp. 48–60. URL: http://ceur-ws.org/Vol-3261/paper4.pdf. [8] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, Towards a unified model for symbolic knowledge extraction with hypercube-based methods, Intelligenza Artificiale 17 (2023) 63–75. URL: https://doi.org/10.3233/IA-230001. doi:10.3233/IA-230001. [9] G. Ciatto, F. Sabbatini, A. Agiollo, M. Magnini, A. Omicini, Symbolic knowledge extraction and injection with sub-symbolic predictors: A systematic literature review, ACM Computing Surveys 56 (2024) 161:1–161:35. URL: https://doi.org/10.1145/3645103. doi:10.1145/ 3645103. [10] J. L. Haggerty, R. J. Reid, G. K. Freeman, B. H. Starfield, C. E. Adair, R. McKendry, Continuity of care: A multidisciplinary review, Bmj 327 (2003) 1219–1221. [11] W. H. Wolberg, O. L. Mangasarian, Multisurface method of pattern separation for medical diagnosis applied to breast cytology., Proceedings of the national academy of sciences 87 (1990) 9193–9196. [12] F. Sabbatini, R. Calegari, Symbolic knowledge-extraction evaluation metrics: The FiRe score, in: K. Gal, A. Nowé, G. J. Nalepa, R. Fairstein, R. Rădulescu (Eds.), Proceed- ings of the 26th European Conference on Artificial Intelligence, ECAI 2023, Kraków, Poland. September 30 – October 4, 2023, 2023. URL: https://ebooks.iospress.nl/doi/10. 3233/FAIA230496. doi:10.3233/FAIA230496. [13] W. Duch, R. Adamczak, K. Grabczewski, A new methodology of extraction, optimization and application of crisp and fuzzy logical rules, IEEE Transactions on Neural Networks 12 (2001) 277–306. URL: https://doi.org/10.1109/72.914524. doi:10.1109/72.914524. [14] Y. Hayashi, S. Nakano, Use of a recursive-rule extraction algorithm with J48graft to achieve highly accurate and concise rule extraction from a large breast cancer dataset, Informatics in Medicine Unlocked 1 (2015) 9–16. [15] F. Sabbatini, G. Ciatto, A. Omicini, GridEx: An algorithm for knowledge extraction from black-box regressors, in: D. Calvaresi, A. Najjar, M. Winikoff, K. Främling (Eds.), Explainable and Transparent AI and Multi-Agent Systems. Third International Workshop, EXTRAAMAS 2021, Virtual Event, May 3–7, 2021, Revised Selected Papers, volume 12688 of LNCS, Springer Nature, Basel, Switzerland, 2021, pp. 18–38. doi:10.1007/ 978-3-030-82017-6_2. [16] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, On the design of PSyKE: A platform for symbolic knowledge extraction, in: R. Calegari, G. Ciatto, E. Denti, A. Omicini, G. Sartor (Eds.), WOA 2021 – 22nd Workshop “From Objects to Agents”, volume 2963 of CEUR Workshop Proceedings, Sun SITE Central Europe, RWTH Aachen University, 2021, pp. 29–48. 22nd Workshop “From Objects to Agents” (WOA 2021), Bologna, Italy, 1–3 September 2021. Proceedings. [17] R. Calegari, F. Sabbatini, The PSyKE technology for trustworthy artificial intelligence 13796 (2023) 3–16. URL: https://doi.org/10.1007/978-3-031-27181-6_1. doi:10.1007/ 978-3-031-27181-6_1, xXI International Conference of the Italian Association for Artificial Intelligence, AIxIA 2022, Udine, Italy, November 28 – December 2, 2022, Pro- ceedings. [18] F. Sabbatini, G. Ciatto, R. Calegari, A. Omicini, Symbolic knowledge extraction from opaque ML predictors in PSyKE: Platform design & experiments, Intelligenza Artificiale 16 (2022) 27–48. URL: https://doi.org/10.3233/IA-210120. doi:10.3233/IA-210120. [19] F. Sabbatini, R. Calegari, On the evaluation of the symbolic knowledge extracted from black boxes, AI and Ethics 4 (2024) 65–74. doi:https://doi.org/10.1007/ s43681-023-00406-1.