Case Study on the Development of a Recommender for Apple Disease Diagnosis with a Knowledge-based Bayesian Network Gabriele Sottocornola1 , Sanja Baric1 , Fabio Stella2 and Markus Zanker1 1 Free University of Bozen-Bolzano, Piazza Università, 1, 39100 Bolzano, Italy 2 University of Milano-Bicocca, Piazza dell’Ateneo Nuovo, 1, 20126 Milano, Italy Abstract This paper presents a case-study of a knowledge-based recommender system capable to diagnose post-harvest diseases of apples. It describes the process of knowledge elicitation and construction of a Bayesian Network reasoning system as well as its evaluation with three different types of studies involving diseased apples. The ground truth of diseased instances has been established by genome sequencing in a lab. The paper demonstrates the performance differences of knowledge-based reasoning mechanisms due to different users interacting with the system under different conditions and proposes methods for boosting the performance by likelihood evidence learned from the estimated consensus of users’ and expert’s interactions. Keywords Case Study in Agriculture, Knowledge-based Recommendation, Bayesian Network, Likelihood Evidence 1. Introduction propose BN-DSSApple a decision support system based on the framework of Bayesian Networks (BN), a graphical Apple trees are the most common temperate fruit tree probabilistic method to reason about uncertainty rela- species, since their fruits can be stored for prolonged pe- tionships among symptoms, signs, and diseases. The user riods of time under controlled atmosphere conditions. observation (i.e., the evidence) is elicited incrementally However, physiological disorders and pathogenic mi- through an adaptive question-answering interface, illus- croorganisms can deteriorate the quality and quantity of trated by visual explanation of the requested information the production during storage, and lead to considerable in order to facilitate user understanding. Furthermore, economic losses [1]. For instance, in Northern Europe, we illustrate the process adopted to build the diagnos- storage losses due to pathogenic microorganisms were tic knowledge base with the help of a domain expert in estimated to reach up to 10% in integrated production the field of post-harvest apple diseases. We analyse and and up to 30% in organic production [2]. Therefore, an address the problem of transferability of such an expert effective knowledge-based recommender system, able to model to a larger cohort of users with different exper- timely suggest a correct diagnosis of diseases manifested tise levels. We thoroughly tested BN-DSSApple under on stored apples, is of crucial importance. For instance, different experimental conditions, simulated in 3 user it depends on the exact pathogen species to decide on studies, to prove the effectiveness of the system and its the right strategy for immediate damage containment transferability across different environments. and/or to recommend a plant protection scheme for the The methodological contribution of this case study is following year. In order to reliably determine the na- organized according to this pipeline: a) in Section 3.1, ture of the disease, several macroscopic symptoms, such we describe the application domain and the implemented as appearance, color, texture and consistency of the rot BN-DSSApple system; b) in Section 3.2, we illustrate the need to be considered by the system. Hence, we should process of knowledge elicitation from a domain expert for provide a practical interface to elicit user feedback on crafting the knowledge base of the BN; c) in Section 3.3, manifested symptoms on a diseased apple in order to we formalize the recommendation mechanism responsi- guide the reasoning to recommend a diagnosis. Thus, we ble for the suggestion of a suitable diagnosis given the user feedback; d) in Section 3.4, we define the trasportabil- 3rd Edition of Knowledge-aware and Conversational Recommender Systems (KaRS) & 5th Edition of Recommendation in Complex ity problem of a knowledge-based model and we propose Environments (ComplexRec) Joint Workshop @ RecSys 2021, a possible solution, exploiting the so-called likelihood September 27–1 October 2021, Amsterdam, Netherlands evidence. Envelope-Open gsottocornola@unibz.it (G. Sottocornola); sanja.baric@unibz.it (S. Baric); fabio.stella@unimib.it (F. Stella); markus.zanker@unibz.it (M. Zanker) Orcid 0000-0001-9983-2330 (G. Sottocornola) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Background 3. Methodology A Bayesian Network (BN) [3, 4] is defined by its two 3.1. System Description main components: the qualitative part represented by its graphical structure and the quantitative part consisting The presented knowledge-based decision support system, of the conditional probabilities. More formally, a BN is named BN-DSSApple, is conceptualized as an interactive graphically represented as a directed acyclic graph (DAG) easy-to-use web application that allows users with dif- 𝒢 = (𝑁 , 𝐸), where 𝑁 = {𝑛1 , 𝑛2 , … , 𝑛𝑙 } denotes the set ferent levels of domain expertise in the area of apple of 𝑙 nodes and 𝐸 ⊆ 𝑁 × 𝑁 the set of directed edges be- production (e.g., farmers, quality controllers, and storage tween pairs of nodes. Each node 𝑛𝑖 ∈ 𝑁 in the DAG 𝒢 workers), to perform in-field diagnosis of post-harvest is mapped one-to-one with a random variable 𝑋𝑖 ∈ 𝒳, diseases of apple fruit, relying solely on the observed where 𝒳 denotes the set of random variables involved macroscopic symptoms on the stored fruit. The system in the model. A random variable 𝑋𝑖 ∈ 𝒳 is represented is designed as a recommender engine which collects the by a set of exclusive values (or states) in which the vari- feedback of the user (i.e., the evidence) on a specific apple able might be observed 𝑉 𝑎𝑙(𝑋𝑖 ) = {𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑚 }, where fruit (i.e., the target apple), in order to suggest a suitable 𝑗 diagnosis (i.e., a set of recommended diseases). The rea- 𝑥𝑖 ∈ 𝑉 𝑎𝑙(𝑋𝑖 ) denotes the 𝑗-th value of variable 𝑋𝑖 . We use 𝑗 soning mechanism is performed by a Bayesian Network the notation 𝑋𝑖 = 𝑥𝑖 for an observed event, to express that (BN) based on an ad-hoc knowledge base, constructed variable 𝑋𝑖 ∈ 𝒳 is observed (or instantiated) in the state with the help of a domain expert (as described in 3.2). 𝑗 𝑥𝑖 ∈ 𝑉 𝑎𝑙(𝑋𝑖 ). A conditional probability table (CPT) is asso- Specifically, the system collects user’s feedback about ciated to each random variable 𝑋𝑖 ∈ 𝒳. The CPT specifies the target apple by asking a set of dynamic multiple- the conditional probability distribution 𝑃(𝑋𝑖 |𝑝𝑎(𝑋𝑖 )) ∈ 𝒫 choice questions related to the macroscopic features of over the states of 𝑋𝑖 . Where, 𝒫 represents the set of the observed symptoms (e.g., the shape of the rot, the ori- conditional probabilities in the model, and 𝑝𝑎(𝑋𝑖 ) ⊂ 𝒳 gin of the infection, etc.). Each question is illustrated with denotes the set of the so-called parents of the variable 𝑋𝑖 exemplary pictures, facilitating also non-expert users in associated to the node 𝑛𝑖 in the DAG 𝒢. Specifically, the their understanding. Each question is mapped to a spe- parent set of 𝑋𝑖 is composed by every variable 𝑋𝑗 ∈ 𝒳 cific variable in the BN model. This part of the system is associated to the node 𝑛𝑗 in the DAG 𝒢, connected with dynamic, since the system incrementally adapts the ques- a directed edge to 𝑛𝑖 (the so-called child node). More for- tions path based on the previous answers given by the mally, 𝑝𝑎(𝑋𝑖 ) = {𝑋𝑗 ∈ 𝒳 ∶ (𝑛𝑗 , 𝑛𝑖 ) ∈ 𝐸}. We can further user. For instance, when the system gets the information define an ancestor variable 𝑎𝑛(𝑋𝑖 ) of the variable 𝑋𝑖 , and that spores are visible on the infected apple, it will inquiry a descendant variable 𝑑𝑒(𝑋𝑖 ) of variable 𝑋𝑖 , if exists a di- the user about further features of those spores (i.e., their rected path (i.e., a set of directed edges) connecting node mass distribution, colour, and origin). Furthermore, the 𝑛𝑎 (associated with variable 𝑎𝑛(𝑋𝑖 )) to 𝑛𝑖 (associated with system provides full flexibility to the user, i.e., it allows variable 𝑋𝑖 ), and 𝑛𝑖 to 𝑛𝑑 (associated with variable 𝑑𝑒(𝑋𝑖 )); to navigate the questions path back and forth in order to namely {(𝑛𝑎 , 𝑛𝑗 ), (𝑛𝑗 , 𝑛𝑖 ), (𝑛𝑖 , 𝑛ℎ ), … , (𝑛𝑔 , 𝑛𝑑 )} ⊂ 𝐸. It is im- revise previous answers, to provide multiple answers, or portant to mention that the DAG 𝒢 of the BN specifies to skip questions in case of lacking confidence. a set of probabilistic relationships among variables in the model. Namely, if an edge (𝑛𝑗 , 𝑛𝑖 ) ∈ 𝐸 exists in the graph, this generally implies that a causal relation holds 3.2. Knowledge Elicitation for Bayesian between the variables 𝑋𝑗 and 𝑋𝑖 , associated to nodes 𝑛𝑗 Network and 𝑛𝑖 . Specifically, we typically assume that the parent In order to build a diagnostic reasoning system based on 𝑋𝑗 represents the cause and child 𝑋𝑖 represents the ef- Bayesian network (i.e., both the network structure and fect in the domain. Thus, a fundamental assumption of the CPTs) two options are available: learn from the data conditional (in)dependence between variables could be or elicit the knowledge from the domain literature or the derived. This assumption is the Local Markov Assump- experts, or any combination of the above. At the best tion (or Local Independence Assumption), and it states of our knowledge, no datasets are publicly available to that: given its parents 𝑝𝑎(𝑋𝑖 ) ⊂ 𝒳, defined in the DAG 𝒢, learn significant relationships among apple diseases and a variable 𝑋𝑖 is conditionally independent of all its non- macroscopic symptoms. Thus, we started by analysing a descendant variables. More formally, for each variable 𝑋𝑖 : large OWL ontology which captures the entire life cycle (𝑋𝑖 ⟂ 𝑋𝑗 |𝑝𝑎(𝑋𝑖 )), where 𝑋𝑗 ∉ 𝑑𝑒(𝑋𝑖 ), set of descendants of of apple cultivation, production, handling, and storage, 𝑋𝑖 . This property allows to specify the joint distribution presented in [5]. Hence, we extracted a smaller quantita- over the space of the variables 𝒳 in the BN model through tive part of the presented ontology suitable for our goal, 𝑙 the probability factorization 𝑃(𝒳 ) = ∏𝑖=1 𝑃(𝑋𝑖 |𝑝𝑎(𝑋𝑖 )), which allows a simple reasoning mechanism connecting usually referred to as the chain rule for Bayesian networks. symptoms to diseases, thanks to a set of SWRL rules [6]. The graphical structure of this ontology is represented in of advancement of the post-harvest infection, namely Figure 1. At the best of our knowledge, the difficult task Val(Stage) = {early, medium, late}. This workaround al- to (semi)-automatically construct a BN from a domain lows the expert to visualize a specific condition of the ontology is still under-explored in the literature. Few disease and thus specify a more reliable likelihood of the practical, heuristic solutions can be found [7, 8], which symptoms. can hardly be applied to our case. The main limitation of The final BN-DSSApple graph is reported in Figure 2. such an effort lays in the fact that the two frameworks dif- The central nodes in the network, bolded and empty, rep- fer in the purpose they are used for. An ontology is more resent the two hidden diagnosis variables, namely Disease suitable to describe concepts and qualitative relationships and Stage. On the top part of the network, coloured in (of different nature), while the BN requires quantitative grey, are the nodes related to the lesion properties. On definitions (i.e., probabilistic) of correlation relationships the right-most part, colored in yellow, are the rot prop- related to the reasoning mechanism of phenomena [9]. erties, while on the left-most part, colored in green, are the lesion origin nodes. Finally, in the central-bottom part, colored in orange, are represented the nodes related to the lesion type and other symptoms, under those, col- ored in cyan, the nodes representing the properties of the other symptoms. Figure 1: The initial ontology for BN-DSSApple. We overcame this problem by directly interviewing a domain expert for the construction of the knowledge base. Specifically, we divided this task into two distinct Figure 2: The graph of the Bayesian Network for DSSApple. phases: during the first phase, we identified the random variables (i.e., the macroscopic symptoms) which are rel- evant in the diagnostic process; during the second phase, In the second phase, we interviewed the domain expert we determined the probability values (i.e., the CPTs) quan- in order to define the quantitative probabilistic dependen- titatively linking the diseases to the symptoms. We firstly cies among variables. For simplicity, we decided to start asked the domain expert to review the available ontol- from a situation where all the symptom variables are con- ogy, enrich and adapt it in order to obtain an effective ditionally independent among each other, given the states tool for the diagnosis of post-harvest diseases of apple of Disease and Stage. Furthermore, they all depends from based on visible macroscopic symptoms on it. After few the two hidden variables responsible for the assessment rounds of interaction, we agreed with a set of 27 discrete of the diagnosis (i.e., Disease and Stage). We indicate the random variables (12 boolean and 15 categorical) related Disease variable as 𝐷 ∈ 𝒟, where 𝒟 defines the set of to macroscopical symptoms and signs that could be ob- hidden variables for the model. 𝑉 𝑎𝑙(𝐷) = {𝑑 1 , 𝑑 2 , … 𝑑 𝑛 } served on the infected apple skin and pulp, together with represents the set of states of the variable 𝐷, where 𝑑 𝑖 is two hidden (target) variables, namely Disease and Stage. the 𝑖-th state of the Disease variable (i.e., the 𝑖-th disease We assumed that a target apple could be infected by one in our pool). The Stage variable is referred as 𝑇 ∈ 𝒟 and only one disease and thus, the random variable Dis- and 𝑉 𝑎𝑙(𝑇 ) = {𝑡 1 , 𝑡 2 , … 𝑡 𝑚 } represents the set of states of ease encodes the whole set of bacterial diseases of our variable 𝑇, where 𝑡 𝑖 is the 𝑖-th state of the Stage variable. study, namely the 7 diseases Val(Disease) = {alternaria_rot, All other (observed) variables in the model are referred alternaria_spot, bitter_rot, botrytis, mucor_rot, neofabraea, as symptom variables and they belong to the set 𝒮. A penicillium}. The Stage random variable was introduced generic symptom variable 𝑆𝑖 ∈ 𝒮 is represented by a set 𝑞 𝑗 to facilitate the experts’ probability elicitation task. The of states 𝑉 𝑎𝑙(𝑆𝑖 ) = {𝑠𝑖1 , 𝑠𝑖2 , … 𝑠𝑖 }, where 𝑠𝑖 is the 𝑗-th state variable represents three discrete and symbolic stages of the symptom variable 𝑆𝑖 . Moreover, we adapted the procedures described in [10] for eliciting expert proba- value in order to avoid null probabilities, then values bilities of our network. Specifically, we adopted a mixed are normalized such that ∑𝑟∈𝑉 𝑎𝑙(𝑅) 𝑃(𝑟) = 1.0. This pro- symbolic questionnaire to facilitate the expert express- cess completely defines a probability distribution for the ing the conditional probability of each event. In more categorical random variable 𝑅. details, two techniques were applied depending on the support of the variable. For boolean variables (for each 3.3. Recommendation Mechanism symptom variable 𝑆𝑖 ∈ 𝒮 such that 𝑉 𝑎𝑙(𝑆𝑖 ) = {𝑡𝑟𝑢𝑒, 𝑓 𝑎𝑙𝑠𝑒}), the expert was invited to answer the question: “How fre- In this section, we detail how a ranked list of recom- quently do you observe symptom 𝑆𝑖 = 𝑡𝑟𝑢𝑒, given that you mended diseases (i.e., a diagnosis) is computed after the have an apple infected by disease 𝐷 = 𝑑𝑙 at stage 𝑇 = 𝑡𝑗 ?”. user provides the feedback on a target apple, answering We allowed her to select one option on a pre-defined 6- the questions asked by the system. point scale, including Always (A), Very often (V), Often (O), The reasoning mechanism of the BN allows to perform Sometimes (S), Rarely (R), and Never (N). The expert had the inference, namely, to estimate the posterior proba- to fill a form, providing the answer for each combination bility distribution on a target unobserved variable (i.e., of 𝑑𝑙 ∈ 𝐷 × 𝑡𝑗 ∈ 𝑇. The symbolic scale is converted into an the Disease variable 𝐷), given any set S ∈ 𝒮 of observed actual probability 𝑃(𝑆𝑖 = 𝑡𝑟𝑢𝑒|𝐷 = 𝑑𝑙 , 𝑇 = 𝑡𝑗 ) according variables as provided by the user (i.e., the evidence E). to the scheme reported in Table 1. The complementary The evidence set E is constructed incrementally by the probability is consequentially defined as 𝑃(𝑆𝑖 = 𝑓 𝑎𝑙𝑠𝑒|𝐷 = application. At each step, the application requests the 𝑑𝑙 , 𝑇 = 𝑡𝑗 ) = 1 − 𝑃(𝑆𝑖 = 𝑡𝑟𝑢𝑒|𝐷 = 𝑑𝑙 , 𝑇 = 𝑡𝑗 ). user to answer a multiple-choice question, related to a symptom variable 𝑆𝑖 ∈ 𝒮. When the user submits the 𝑗 answer 𝑃(𝑆𝑖 = 𝑡𝑟𝑢𝑒|𝑑𝑙 , 𝑡𝑗 ) observed state 𝑠𝑖 ∈ 𝑉 𝑎𝑙(𝑆𝑖 ), BN-DSSApple includes the 𝑗 Always (A) 0.999 new information into the evidence set, E ∪ 𝑆𝑖 = 𝑠𝑖 . At Very often (V) 0.8 the end, of this elicitation process, the application will Often (O) 0.6 have access to the complete information provided by Sometimes (S) 0.3 the user on the infected target apple, she wants to diag- Rarely (R) 0.01 nose. It is important to mention that the BN inference Never (N) 0.001 mechanism is robust to missing values, hence, the user is not forced to provide observations for every symptom Table 1 variable 𝑆𝑖 ∈ 𝒮 in the model. Thus, if the user skips the Scale to convert expert knowledge into actual probabilities. How frequently do you observe symptom 𝑆𝑖 = 𝑡𝑟𝑢𝑒, given that question related to variable 𝑆𝑚 ∈ 𝒮, the evidence set E you have apples infected by disease 𝑑𝑙 at stage 𝑡𝑗 ? will not include an observation for that variable, 𝑆𝑚 ∉ E. Thus, the goal of the reasoning system is to provide a probability over the set of candidate diseases (i.e., the For categorical variables (i.e., each symptom variable possible diagnosis). We estimate the posterior probabil- 𝑆𝑖 ∈ 𝒮 such that 𝑉 𝑎𝑙(𝑆𝑖 ) = {𝑠𝑖1 , 𝑠𝑖2 , … , 𝑠𝑖𝑚 }, where 𝑚 > 2), ity distribution 𝑃(𝐷|E) through an algorithm called loopy such a process would have been too burdensome for the belief propagation [11]. The loopy belief propagation is expert. Thus, we decided to adopt a lighter, yet effective, an approximate message-passing method to perform in- approach. For each categorical symptom variable 𝑆𝑖 ∈ 𝒮, ference on graphical models. In few words, the algorithm given a specific disease 𝐷 = 𝑑𝑙 at stage 𝑇 = 𝑡𝑗 , the expert iteratively updates the marginal distribution 𝑃(𝑁 ) of a was invited to simply indicate which values of 𝑉 𝑎𝑙(𝑆𝑖 ) node 𝑁 ∈ 𝒢, by updating the outgoing message, at the are likely to be observed. Furthermore, we agreed on a current iteration, from the node 𝑁 to each of its neigh- 3-point symbolic annotation to denote the likelihood of bors V ∈ 𝒢 in terms of the previous iteration’s incoming each reported value, namely, common (no parenthesis), messages from V. less common (one parenthesis), and rare (two parenthe- In our recommendation engine, after completing the sis). The assumption underneath this choice is that many evidence collection process for a target apple 𝑎, the pos- symptom values are never observed under some condi- terior probability computed by the BN when evidence E tions (i.e., resulting CPTs are sparse) and could be ignored is provided, is considered as a diagnosis score 𝑠(𝑑𝑖 )𝑎 for to speed up the elicitation process. In order to convert each disease 𝑑𝑖 ∈ 𝐷. Namely, this probability distribu- likelihood annotations into actual probability distribu- tion represents the confidence of the system over each tion values we adopted the following heuristic. Please disease 𝑑𝑖 ∈ 𝐷 being the correct diagnosis for the target consider a random variable 𝑅 with 𝑉 𝑎𝑙(𝑅) = {𝑎, 𝑏, 𝑐, 𝑑}, apple 𝑎. More formally, given the provided evidence set which is annotated as follows by the the expert: a: com- 𝑝 𝑞 E = {𝑆1 = 𝑠1𝑜 , 𝑆2 = 𝑠2 , … 𝑆𝑙 = 𝑠𝑙 }, defined as the set of mon, b: less common, c: rare, and d is ignored; then 𝑗 𝑃(𝑎) = 2𝑃(𝑏) = 4𝑃(𝑐) = 1.0 and 𝑃(𝑑) = 0.0. Further- each observed state 𝑠𝑖 ∈ 𝑉 𝑎𝑙(𝑆𝑖 ) for each random variable more, a small value 𝜖 = 0.001 is added to each probability 𝑆𝑖 ∈ 𝒮, the diagnosis score related to target apple 𝑎 for disease 𝑑𝑖 ∈ 𝐷 is computed as: The problem of transferability is long-lasting in ma- chine learning and statistics and it has been addressed 𝑠(𝑑𝑖 )𝑎 = 𝑃(𝐷 = 𝑑𝑖 |E) (1) in causal terms, referred to as transportability [12, 14], The ranked list of the 𝑘 suggested diseases 𝑅𝑘 = as well as in statistical terms, in the context of super- {𝑑 1 , 𝑑 2 , … , 𝑑 𝑘 } shown to the user is then based on the vised learning, where it is also known as covariate shift score for each disease, such that 𝑠(𝑑 𝑖 ) ≥ 𝑠(𝑑 𝑖+1 ). The or sample selection bias [15, 16]. One of the most common parameter 𝑘 controls for the flexibility of the system to approaches applies a direct correction to the learned prob- show more or less recommended diseases to the user. In ability distribution based on the estimates on the testing our evaluation, the parameter is fixed to 𝑘 = 3. set [13]. Specifically inspired by the work presented in [17], we proposed a methodology, referred to as likeli- 3.4. Transferability and Likelihood hood evidence and tailored to our BN-based application, to correct the expert-defined distribution 𝒫 𝑒𝑥𝑝 towards the Evidence one derived by users 𝒫 𝑢𝑠𝑟 . We define the likelihood evi- In knowledge-based modeling, but also with standard dence (or likelihood finding) for each random symptom supervised learning, we often face the problem of trans- variable 𝑆𝑖 ∈ 𝒮 of our BN-DSSApple. Specifically, when a ferring such a model on a different environment (i.e., pro- symptom variables 𝑆𝑖 is observed and thus instantiated viding external validity). This type of situation is referred by a user, we assume that a certain degree of uncertainty to as the transferability problem [12, 13]. For instance, is associated with it (i.e., the difference of knowledge and it might be difficult to allow a vast set of users, with expertise between the user and the expert). We define the different expertise level, to effectively exploit a diagnos- actual user observation with another random variable tic expert model, based on domain-specific knowledge. 𝑂𝑖 , such that 𝑉 𝑎𝑙(𝑂𝑖 ) = 𝑉 𝑎𝑙(𝑆𝑖 ), to distinguish it from the In our application, the knowledge base of BN-DSSApple variable as it should be observed by an expert 𝑆𝑖 . We has been built with the information derived from do- represent the uncertainty degree with a likelihood ratio main literature and empirical knowledge of a domain 𝐿(𝑆𝑖 ), formally defined as: expert. Nevertheless, different sets of users, with less ex- 𝑗 𝑗 perience in the field, might perceive the same attributes 𝐿(𝑆𝑖 = 𝑠𝑖 ) = 𝑃(𝑂𝑖 = 𝑜𝑖𝑙 |𝑆𝑖 = 𝑠𝑖 ) (2) (i.e., the symptoms) in a different way. In fact, the user perception is mediated by her personal experience and which represents the probability of a user observing value specific knowledge biases. This mismatch invalidates the 𝑜𝑖𝑙 ∈ 𝑉 𝑎𝑙(𝑂𝑖 ) given that, in the same situation, the expert 𝑗 effectiveness and hence the diagnostic performance of would have observed 𝑠𝑖 ∈ 𝑉 𝑎𝑙(𝑆𝑖 ). Thus, we enrich our BN-DSSApple. In this section, we formalize the problem BN by adding, for each symptom variable 𝑆𝑖 , a virtual of transferability and we propose a practical solution to likelihood evidence node 𝑂𝑖 that encodes the likelihood bridge the gap between the expert model and the user ratio 𝐿(𝑆𝑖 ), with 𝑝𝑎(𝑂𝑖 ) = {𝑆𝑖 }. The added set of random perception. variable 𝒪 = {𝑂1 , 𝑂2 , … 𝑂𝑡 } is now the one observed by In our scenario, the transferability problem is defined the user while providing the evidence E on the questions as the mismatch between the BN probability distribu- asked by the application, while the random variables in tions (CPTs) defined by the expert, and the probability 𝒮 become hidden. We finally need to define a new set distributions derived by the usage of the system. For- of conditional probability tables 𝑃(𝑂𝑖 |𝑆𝑖 ) for each pair mally, the expert during the knowledge elicitation phase (𝑆𝑖 , 𝑂𝑖 ) ∈ 𝒮 × 𝒪. We adopt a direct estimation of these (as described in Section 3.2) implicitly defined a com- probabilities from the observed interactions of users with plete set of probability 𝒫 𝑒𝑥𝑝 = {𝑃(S|𝐷 = 𝑑1 ), 𝑃(S|𝐷 = a set of apples 𝒜 for which we know the actual observed 𝑗 𝑑2 ), … 𝑃(S|𝐷 = 𝑑𝑛 )} ⊆ 𝒫, for each set of symptom random value by the expert. Namely, for each state 𝑠𝑖 ∈ 𝑉 𝑎𝑙(𝑆𝑖 ) variables S, given the target disease 𝐷 = 𝑑𝑖 . At testing of each variable 𝑆𝑖 ∈ 𝒮 we define a subset of 𝒜𝑠 𝑗 ⊆ 𝒜 for 𝑖 time, the users of our application produced a set of 𝑢 which the value of the symptoms variable 𝑆𝑖 observed by observations ℰ = {(E1 , 𝑑1 ), (E2 , 𝑑2 ), … (E𝑢 , 𝑑𝑢 )} ⊆ 𝒮 × 𝒟, 𝑗 the expert is 𝑆𝑖 = 𝑠𝑖 . Thus, the conditional probability of 𝑝 𝑞 where E𝑖 = {𝑆1 = 𝑠1𝑜 , 𝑆2 = 𝑠2 , … 𝑆𝑙 = 𝑠𝑙 }, represent the the observed value 𝑂𝑖 = 𝑜𝑖𝑙 by the users is defined as: evidence provided by a user during the 𝑖-th diagnosis ses- sion, as a set of instantiations of symptom variables, and 𝑗 1 𝑃(𝑂𝑖 = 𝑜𝑖𝑙 |𝑆𝑖 = 𝑠𝑖 ) = ∑ 1 (𝑜 𝑙 ) (3) 𝑑𝑖 is the corresponding ground-truth disease. These set |𝒜𝑠 𝑗 | 𝑎 ∈𝒜 𝑗 𝑎𝑖 𝑖 𝑖 𝑖 of user observations define a different set of probabilities 𝑠𝑖 𝒫 𝑢𝑠𝑟 = {𝑃(S|𝐷 = 𝑑1 ), 𝑃(S|𝐷 = 𝑑2 ), … , 𝑃(S|𝐷 = 𝑑𝑛 )} ⊆ 𝒫, which is generally different from the one defined by the where 1𝑎𝑖 (𝑜𝑖𝑙 ) is an indicator function which is equal to 1 if expert, 𝒫 𝑢𝑠𝑟 ≠ 𝒫 𝑒𝑥𝑝 . The problem becomes the one to the user observed 𝑂𝑖 = 𝑜𝑖𝑙 in apple 𝑎𝑖 , and 0 otherwise. The find a transferability function 𝑇 (.) to be applied to the defined conditional probability for the likelihood ratio is expert model such that 𝒫 𝑢𝑠𝑟 = 𝑇 (𝒫 𝑒𝑥𝑝 ). also referred as consensus among expert and users. 4. Experiments # users expertise # apples time-span SES 1 high 21 2 weeks 4.1. User Study Evaluation SUS 1 high-medium 131 3 months MUS 11 medium-low 21 4 hours We conducted a large user study to evaluate the effec- tiveness of BN-DSSApple in recommending the correct Table 2 diagnosis. Specifically, we divided the user study into Characteristics of the three user studies: Single Expert Study three distinct phases to test the system behaviour under (SUS), Single User Study (SUS), and Multiple User Study different circumstances. The task submitted to the users (MUS). involved in our study was the same in all cases. The user received a “bucket” of infected apples, for which she had to find the correct diagnosis leveraging BN-DSSApple. 𝑗 is a ranked list of 𝑘 suggested diagnosis 𝑑𝑎𝑖 for apple 𝑎𝑖 Each target apple was simulated as a set of two high- with a specific ground truth disease 𝑡𝑎𝑖 . Thus, we formally definition photos depicting an internal and an external define recall@k as: view of the target apple, and for which the ground-truth disease was collected in lab by genome sequencing. In 1 𝑟𝑒𝑐𝑎𝑙𝑙@𝑘 = ∑ 1𝑅𝑎𝑘 (𝑡𝑎𝑖 ) (4) each diagnostic round, the user had to carefully inspect 𝑛 𝑘 𝑖 𝑅𝑎𝑖 ∈𝑁 the target apple and interact with the system by provid- ing information (i.e., the evidence) about the symptoms Where the function 1𝑅𝑎𝑘 (𝑡𝑎𝑖 ) is an indicator function 𝑖 and signs she was able to identify on the apple. At the which is equal 1 if 𝑡𝑎𝑖 ∈ 𝑅𝑎𝑘𝑖 and 0 otherwise. end, BN-DSSApple returned a ranked list of three sug- gested diagnosis, i.e., the three diseases with the highest posterior given the available evidence, as computed by SES SUS MUS ZeroR the BN. The three phases of the presented study differed recall@1 .905 .489 .286 .143 in the number of users, their expertise level, and the recall@2 1. .656 .403 .286 number of distinct target apples involved. In details, we recall@3 1. .763 .571 .429 performed: Table 3 • Single Expert Study (SES): a domain expert (the Recall@k for the three user studies performed, Single Expert one which collaborate in the construction of the Study (SES), Single User Study (SUS), and Multiple Users Study (MUS). The ZeroR benchmark is also reported. BN) interacted with the system to diagnose 21 target apples in a time-span of around 2 weeks. From the results presented in Table 3 we highlight how • Single User Study (SUS): a single user (a MSc the theoretical effectiveness of the BN-DSSApple model is student in Biology), interacted with the system very high. Specifically, an expert user (SES), with strong during the course of an internship, lasting around knowledge in the domain of post-harvest diseases of ap- 3 months, to diagnose 131 target apples. ples and a good capability of correctly identify symptoms on a diseased apple, is able to reach a recall@1 above • Multiple Users Study (MUS): a group of 11 stu- the 90%. The performance of the system increases up dents of a Phytopatology class interacted with to 100% of recall when evaluated at a larger cut-off of the system to diagnose a bucket of 7 target ap- suggested diseases. Of course, we have to consider that ples each. The apples were randomly sampled in the SES evaluation, we are in the ideal situation where from the same set of 21 apples used for SES. The the expert user knows exactly how to look and evalu- activity lasted for a total of 4 hours. ate the symptoms requested by BN-DSSApple. A more In Table 2 we summarize the different characteristics of realistic situation is depicted by the SUS evaluation. In the three user studies performed. this situation, a single user with a medium-high level of expertise had months of time to interact with the sys- 4.2. Results tem by evaluating a very large set of apples (131). The performance of the system for the recall@1 are still con- In Table 3 we report the results of the three user stud- vincing (49%), i.e. correct disease identification by half of ies in terms of recall@k. To better formalize this metric, all diagnoses. The other metrics testify how the system please consider a situation in which a set 𝑁 of 𝑛 diagnosis is not able to scale-up well for further cut-off of recall, is performed by BN-DSSApple. The set 𝑁 is composed achieving 66% of recall@2 and 76% of recall@3 (the cor- by 𝑛 ranked lists of recommended diagnosis, namely rect disease is within the first 3 recommendations in 3/4 𝑁 = {𝑅𝑎𝑘1 , 𝑅𝑎𝑘2 , … 𝑅𝑎𝑘𝑛 }, where 𝑎𝑖 represents the 𝑖-th apple of the cases). Finally, BN-DSSApple showed some limits processed by the system. A generic 𝑅𝑎𝑘𝑖 = {𝑑𝑎1𝑖 , 𝑑𝑎2𝑖 , … , 𝑑𝑎𝑘𝑖 } in the situation where the users have a limited expertise and training, and a limited amount of time (few hours) rank attribute consensus to use the system as in the MUS evaluation. In addition 1 Sclerotia 0.988 to the time and skill aspect, also less intrinsic motiva- 2 Calyx 0.985 tion to interact as accurate as possible with the system 3 Rot 0.964 could be a partial explanation for the deviation. In this 4 Spot 0.950 case, the measured recall of the system is significantly 5 Stalk 0.926 lower than the one of the two previous evaluations. Par- 6 Core 0.917 ticularly, the recall@1 doesn’t reach the 30%, while the 7 Spore_distribution 0.872 best result is achieved by the recall@3 with a value of 8 Lesion_size 0.837 9 Lesion_surface 0.837 57% (slightly more than half of the diagnosis include the 10 Number_lesions 0.817 correct disease in the top-3 recommendations). Neverthe- 11 Mycelium_spore 0.809 less, despite the poor performances of BN-DSSApple in 12 Lesion_form 0.792 MUS, the collected results are still superior to the ZeroR 13 Lesion_crack 0.790 benchmark, namely, a classifier which always suggest 14 Halo 0.782 the class with a priori higher probability. Important to 15 Rot_shape 0.760 notice that the reported results for ZeroR are related to 16 Rot_texture_dry 0.755 the situation in which the class (ground-truth disease) 17 Halo_colour 0.750 distribution is perfectly balanced, like for SES and MUS. 18 Rot_margin 0.740 In the comparison with ZeroR, MUS evaluation for BN- 19 Spore_colour 0.731 20 Spore_origin 0.694 DSSApple shows the double of recall@1 (28.6% against 21 Lesion_margin 0.636 14.3%), while recall@2 and recall@3 are closer but still 22 Lesion_area 0.623 significantly better (+12% and +14%, respectively). The 23 Rot_texture_opaque 0.607 main cause of this mismatch of performances among ex- 24 Wound 0.594 pert and averaged users can be identified in the problem 25 Lenticel 0.588 of transferability of a knowledge-aware model. In the 26 Lesion_appearance 0.417 remaining of this section, we are going to empirically an- 27 Rot_texture_pressure 0.321 alyze and explain such a phenomenon, and test possible solutions to correct and alleviate it. Table 4 Attributes ranking based on the rate of agreement (i.e., con- Foremost, we want to understand the impact of each sensus) of the users of MUS with the domain expert of SES. expert-defined attribute in the model. In Table 4 we report the ranked list of attributes, based on the like- lihood ratio (i.e., consensus) computed between users of MUS and the expert of SES (which we consider as a user, with a consensus above the 90% with the expert. ground-truth) in identifying the symptoms on the same Nevertheless, two of them, namely Wound and Lenticel, set of 21 target apples. It is interesting to notice how are equally difficult to be recognized with a consensus of the users are effective in identifying the principal symp- around 59%. This is probably due to the fact that the two toms and signs, presented by the application as boolean origins might be perceived as quite similar and could be variables. Namely, Sclerotia (99%), Rot (96%), and Spot confused, without a careful inspection of the apple skin. (95%) present a very high level of agreement with the In Figure 3 we plot the recall@k achieved by BN- domain expert, while Mycelium_spore (81%) and Halo DSSApple for MUS and SES, by incrementally selecting (78%) receive an high consensus. Vice versa, some quali- the attributes based on the consensus ranking reported tative attributes related to the appearance or the consis- in Table 4. On the x-axis, we report the number of at- tency of the lesion and the rot are among the hardest to tributes in each model configuration. Namely, the 𝑖-th be correctly recognized by the users (i.e., they show a value represents the BN model built with the attribute poor consensus with the expert). For example, Lesion_ap- set 𝒜𝑖 = {𝑎1 , 𝑎2 , … 𝑎𝑖−1 , 𝑎𝑖 }, where the rank 𝑗 of attribute pearance and Rot_texture_pressure achieve a consensus 𝑎𝑗 is defined by expert consensus, as reported in Table below the 50%, while Lesion_margin, Lesion_area, and 4. From the graph in Figure 3a for MUS evaluation, we Rot_texture_opaque are below 65%. Nevertheless, other immediately notice how the model achieves the best per- categorical variables more related to quantitative aspects formances for recall@1 and recall@2 with around 8-9 of the lesion are easier for the users to be spotted. This attributes. A larger set of attributes is detrimental, caus- is the case of the variables Lesion_size, Lesion_surface, ing a drop of recall of at least 10% in both situations. Lesion_form, and Lesion_crack which show a consensus Interesting to notice how these performances seem to between 84% and 79%. Finally, it is interesting to notice recover with the models based on 21-22 attributes, with- the behavior of the variables of the Lesion origin cate- out reaching the optimal level. In fact, for the recall@3 gory. Most of them are quite easy to be identified by the metric the global optimum is achieved by the model with (a) (b) Figure 3: Recall@k by incremental selection of attributes based on ranking of Table 4 for MUS (a) and SES (b). 20 attributes, with a significant improvement of around data with the Maximum Likelihood Estimation (MLE) al- 10% on the smaller attribute set configurations. Opposite gorithm. The recall@1 improvement is marginal (around considerations emerge from the graph in Figure 3b for +2.5%), while recall@2 shows a +6.5% with respect to the SES evaluation. In this case, the recall@k metrics are lin- plain BN model. We already commented the large im- early correlated to the number of attributes, and the best provements achieved by selecting the optimal attribute performances are always achieved with the full set of set (BEST-ATTR model), whereas the gain in recall is be- attributes. This means that the expert is able to correctly tween +14% and +21%. Of course, this analysis is derived instantiate even the harder variables, by understanding a posteriori, where the optimal number of attributes is the status of an infected apple. Furthermore, this “hard- fixed after the evaluation. For this reason, the achieve- to-recognize” attributes are necessary to significantly ment of the model equipped with likelihood evidence improve the diagnostic effectiveness of the model and (LH-EV, methodology detailed in Section 3.4, where ex- reach the highest performances in term of recall@k. For pert ground-truth data are derived from SES) is even instance, in both recall@2 and recall@3 the BN model greater. For recall@1 the LH-EV outperforms TRAIN-BN registers around +20% improvement by considering the of around +4%, while being inferior to BEST-ATTR by full set of 27 attributes instead of just considering 21 around -8%. For recall@2, instead, the likelihood evi- attributes (i.e., by discarding the 6 “hardest” attributes, dence achieves the best result outperforming also BEST- with lowest consensus). ATTR by a +2.5%. Finally, for recall@3 the LH-EV model significantly outscores TRAIN-BN (+13%), while being BN TRAIN-BN BEST-ATTR LH-EV comparable with the results of BEST-ATTR. recall@1 .286 .312 .429 (8) .351 recall@2 recall@3 .403 .571 .468 .636 .597 (9) .779 (20) .623 .766 5. Conclusions Table 5 This case study focused on knowledge elicitation and Recall@k for MUS when applying the plain BN-DSSApple construction as well as discussed the application of likeli- (BN), the trained BN-DSSApple on MUS data (TRAIN-BN), hood evidence to enhance performance and transferabil- the incremental best attribute selection (BEST-ATTR), and ity of the knowledge-based recommendation system BN- the BN-DSSApple with likelihood evidence (LH-EV). In BEST- DSSApple. Major limitations of the presented approach concern the fact that the knowledge base is fully based ATTR column, we report the results for the optimal attribute set, with the number of selected attributes in parenthesis.on qualitatively probability elicitation from a single hu- man expert. Furthermore, transferability problem of the Finally, in Table 5 we compare the recall@k results for crafted BN must be additionally investigated. Further the MUS evaluation of the improved versions of the BN development of the method to other domains as well as model, in order to cope with the transferability problem additional testing is required. Currently, deployment for discussed in Section 3.4. Firstly, the smallest improve- real-life evaluation is ongoing. In future work, the inte- ment is provided by the trained BN model (dubbed as gration of additional evidence like microscopic images TRAIN-BN), where the parameters are fine-tuned on MUS of fungal spores will be considered. References in: Proceedings of the 2011 IEEE 11th Interna- tional Conference on Data Mining Workshops, [1] T. B. Sutton, H. S. Aldwinckle, A. Agnello, J. F. Wal- ICDMW ’11, IEEE Computer Society, USA, 2011, genbach (Eds.), Compendium of apple and pear dis- p. 540–547. URL: https://doi.org/10.1109/ICDMW. eases and pests, 2 ed., APS press, 2014. 2011.169. doi:1 0 . 1 1 0 9 / I C D M W . 2 0 1 1 . 1 6 9 . [2] P. Maxin, M. Williams, R. W. Weber, Control of fun- [13] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, G. Zhang, gal storage rots of apples by hot-water treatments: Transfer learning using computational intelligence: A northern european perspective, Erwerbs-Obstbau A survey, Knowl. Based Syst. 80 (2015) 14–23. 56 (2014) 25–34. [14] A. Subbaswamy, S. Saria, Counterfactual normal- [3] D. Koller, N. Friedman, Probabilistic Graphical ization: Proactively addressing dataset shift using Models: Principles and Techniques, Adaptive causal mechanisms, in: R. Silva, A. Globerson, computation and machine learning, MIT Press, A. Globerson (Eds.), 34th Conference on Uncer- 2009. URL: https://books.google.co.in/books?id= tainty in Artificial Intelligence 2018, UAI 2018, vol- 7dzpHCHzNQ4C. ume 2, Association For Uncertainty in Artificial [4] U. B. Kjaerulff, A. L. Madsen, Bayesian Networks Intelligence (AUAI), 2018, pp. 947–957. 34th Confer- and Influence Diagrams: A Guide to Construction ence on Uncertainty in Artificial Intelligence 2018, and Analysis, 1st ed., Springer Publishing Company, UAI 2018 ; Conference date: 06-08-2018 Through Incorporated, 2010. 10-08-2018. [5] A. Niederkofler, S. Baric, G. Guizzardi, G. Sotto- [15] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, cornola, M. Zanker, Knowledge models for diag- B. Scholkopf, Correcting sample selection bias by nosing postharvest diseases of apples, in: Proceed- unlabeled data, in: Proceedings of the 19th Interna- ings of the Joint Ontology Workshops 2019 Episode tional Conference on Neural Information Process- V: The Styrian Autumn of Ontology, Graz, Aus- ing Systems, NIPS’06, MIT Press, Cambridge, MA, tria, September 23-25, 2019, volume 2518 of CEUR USA, 2006, p. 601–608. Workshop Proceedings, CEUR-WS.org, 2019. URL: [16] M. Sugiyama, S. Nakajima, H. Kashima, P. v. Bü- http://ceur-ws.org/Vol-2518/paper-ODLS6.pdf. nau, M. Kawanabe, Direct importance estimation [6] M. Zanker, M. Jessenitschnig, W. Schmid, Prefer- with model selection and its application to covari- ence reasoning with soft constraints in constraint- ate shift adaptation, in: Proceedings of the 20th based recommender systems, Constraints 15 (2010) International Conference on Neural Information 574–595. Processing Systems, NIPS’07, Curran Associates [7] M. B. Messaoud, P. Leray, N. B. Amor, Sem- Inc., Red Hook, NY, USA, 2007, p. 1433–1440. cado: A serendipitous strategy for causal discov- [17] A. B. Mrad, V. Delcroix, S. Piechowiak, P. Leicester, ery and ontology evolution., Knowl.-Based Syst. M. Abid, An explication of uncertain evidence in 76 (2015) 79–95. URL: http://dblp.uni-trier.de/db/ bayesian networks: likelihood evidence and proba- journals/kbs/kbs76.html#MessaoudLA15. bilistic evidence - uncertain evidence in bayesian [8] A. M. Kalet, J. N. Doctor, J. H. Gennari, M. H. networks, Appl. Intell. 43 (2015) 802–824. URL: Phillips, Developing bayesian networks from a https://doi.org/10.1007/s10489-015-0678-6. doi:1 0 . dependency‐layered ontology: A proof‐of‐concept 1007/s10489- 015- 0678- 6. in radiation oncology, Medical Physics 44 (2017) 4350–4359. doi:1 0 . 1 0 0 2 / m p . 1 2 3 4 0 . [9] S. Fenz, An ontology-based approach for construct- ing bayesian networks, Data Knowl. Eng. 73 (2012) 73–88. URL: http://dx.doi.org/10.1016/j.datak.2011. 12.001. doi:1 0 . 1 0 1 6 / j . d a t a k . 2 0 1 1 . 1 2 . 0 0 1 . [10] L. C. van der Gaag, S. Renooij, C. L. M. Witteman, B. M. P. Aleman, B. G. Taal, How to elicit many probabilities, in: Proceedings of the Fifteenth Con- ference on Uncertainty in Artificial Intelligence, UAI’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999, p. 647–654. [11] A. T. Ihler, J. W. Fischer III, A. S. Willsky, Loopy belief propagation: Convergence and effects of mes- sage errors, J. Mach. Learn. Res. 6 (2005) 905–936. [12] J. Pearl, E. Bareinboim, Transportability of causal and statistical relations: A formal approach,