Knowledge Acquisition in the construction of ontologies: a case study in the domain of hematology Fabrício M. Mendonça1,* Kátia C. Coelho1 André Q. Andrade1,2 and Mauricio B. Almeida3 1 Graduate Program in Information Science, Federal University of Minas Gerais, Brazil 2 Institute for Medical Informatics, Medical University of Graz, Austria 3 Department of Information Theory and Management, Federal University of Minas Gerais, Brazil ABSTRACT The activities of organizing knowledge recorded in texts and obtaining 2 BACKGROUND knowledge from human experts – the knowledge acquisition process – are essential for scientific development. In this article, we propose 2.1 An overview of Knowledge Acquisition methodological steps for knowledge acquisition, which have been applied to the construction of biomedical ontologies. The methodologi- The KA activity generally includes the collection, analysis, cal steps are tested in a real case of knowledge acquisition in the do- structuring and validation of knowledge for representation main of the human blood. We hope to contribute to the improvement of knowledge acquisition for the representation of scientific knowledge purposes (Hua, 2008). It is an activity composed of a set of in ontologies. tasks that employ computer-based and manual techniques (Gaines, 2003; Boose & Gaines, 1989; Shadbolt, 2005). A multitude of definitions for KA can be found (Shaw & 1 INTRODUCTION Gaines, 1996; Scott & Clayton, 1991; Payne et al, 2007) and Ontologies have been proposed as an alternative for creating the theories and methods that support KA activities rely on representations of reality suitable for computers. At least diverse academic research fields. Ways of acquiring and four activities are essential in the development of ontolo- representing knowledge come from Computer Science gies: specification, knowledge acquisition, conceptualizatio (Compton & Jansen, 1989), Cognitive Science (Hawkins, and formalization. In knowledge acquisition (KA), the expe- 1983), Linguistics (Campbell et al, 1998) and Psychology rience available in the literature of diverse fields mentions (Harris, 1976). difficulties in communication between experts and profes- sionals who deal with information (Boose, 1990). 2.2 Classification of KA techniques This article investigates the activity of KA within the KA techniques can be classified into manual techniques and scope of biomedicine. In order to explore the activity, we computer-based techniques (Boose, 1990). In general, the propose procedures for KA employing the best practices manual techniques are rooted in Psychology (Kelly, 1955) referenced in the literature. We systematize these proce- and computer-based techniques are classified as automatic dures in a list of methodological steps with the aim of test- or semi-automatic. KA can be classified according to the ing their feasibility in a real case. knowledge obtained in the process. The assumption that The empirical research is conducted within the scope of a different methods of elicitation result in different types of biomedical project, focused on human blood. The know- knowledge is known as the differential access hypothesis ledge acquisition results have been used in the development (Hoffman et al, 1995). In addition, KA can be classified of a knowledge base for scientific and educational applica- according to application methods such as protocol- tions related to the human blood. Descriptions of different generation techniques, protocol-analysis techniques, matrix- stages of research are provided as examples throughout the based techniques and sorting techniques (Shadbolt & Swal- article. The main contributions are the aforementioned list low, 1993). of steps and observations made in real situations with the Protocol-generation techniques include interviews. The aim of improving the KA performance. The remainder of this paper is organized as follows: sec- most well-known technique for interviews is the teachback tion 2 reviews the literature on KA. Section 3 explains the technique (Hua, 2008; Shadbolt, 2005). Protocol-analysis theoretical rationale, the systematization and tools that com- techniques are used in the transcription of interviews in or- pose the KA methodology. Section 4 presents comments of der to identify different knowledge types. Matrix-based interest during the next phases of the research. Finally, sec- techniques involve the diagrammatic organization of prob- tion 5 puts forward our final remarks. lems. The most well-known technique is the repertory grid (Hua, 2008; Shadbolt, 2005). Sorting techniques are tech- niques in which the domain entities are classified in order to check how an expert classifies the knowledge. The most * To whom correspondence should be addressed: fabriciommendon- well-known technique is card sorting (Hua, 2007; Hoffman ca@gmail.com 1 Fabrício M. Mendonça et al. et al, 1995). The Diagram-based technique consists of the In the activity of codification we employed Sketch En- creation and use of network representations, such as concep- gine2, an online tool for the creation and analysis of linguis- tual maps (Corbridge et al, 1994). A methodology for KA tic corpora. The fragmentation of the text into morphemes that combines card sorting and laddering can be employed and the identification of the grammatical classes are auto- in the construction of ontologies (Wang et al, 2006). matically performed. After the codification activity, we proceeded with the in- 2.3 KA in Biomedicine formation retrieval from the corpus with the aim of identify- Natural Language Processing (NLP) techniques are com- ing terms used to describe blood transfusion procedures. In mon in the biomedical domain (Hersh, 2009; Verspoor et al, order to do so, we used word suffixes common of medical 2006). These techniques can be divided into two main terms (Lovis, Baud & Rassinoux, 1998) as such -apheresis, streams: the rule-based approach (Friedman et al, 2004; -centesis, -desis, -ectomy, -opsy, to mention but a few. Then, Hahn, Romacker & Schulz, 2002) and the statistical ap- we built regular expressions using the Sketch Engine corpus proach (Taira & Soderland, 1999; Sebastiani, 2002). query language, in order to retrieve terms related to proce- A comparison between the two methods involved the test- dures, as well as the absolute frequencies that occur in the ing of systems using both approaches to the automatic cate- corpus. gorization of MEDLINE abstracts (Humphrey et al, 2009) As a final task of the extraction phase, we analyzed the and found comparable results for most evaluated items. The morphological productivity of the terms obtained using the results favored the statistical approach, though the authors British National Corpus (BNC)3 as a reference. The analysis suggested the combination of both approaches. consisted of comparing the frequency of each term in the corpus with its frequency in the reference corpus. In order to 3 METHODS proceed with the morphological productivity analysis we 3.1 Case study: knowledge context and domain used the AntConC4 tool. In the elicitation phase, we made use of the terms obtained This work explores the best practices in an ongoing KA in the extraction phase, which were employed as guidelines scenario applied within the scope of the Blood Project (Al- to start the contact with experts. This phase consisted of meida, Proietti & Smith, 2011), an information organization holding interviews and the application of KA techniques initiative in hematology. The project is taking place in a with experts, doctors, biologists and researchers. During the medical institution responsible for hematology and blood course of the interviews, sorting and matrix techniques were transfusion research and that offers healthcare services for a applied. The cycle that characterizes the clinical process, population of around 20 million people. ranging from the development of an infectious disease 3.2 Methodological steps through its treatment, was adopted to guide the approach In this section, we describe the list of steps for KA. Then, taken with the experts. For modeling the domain, we we present a synoptic table summarizing the tasks involved adopted the disease as disposition approach, as proposed by and systematizing the steps in the list, which was divided (Scheuermann, Ceusters & Smith, 2009). The three major into four main phases: extraction, elicitation, validation and stages that comprise that cycle are: etiological process, refinement. course of disease and therapeutic response. In order to apply In the extraction phase we applied NLP techniques and the described reasoning so far, a template was created in tools in order to obtain candidate terms for the ontology. Protégé-Frames. KA from texts consists of three main activities: construction In the stage called etiological process, there is a healthy of a corpus related to blood transfusion, codification of this human body with characteristics that are normal according corpus and information retrieval from the corpus. to medical parameters. In the pre-clinical manifestation of The subset of the corpus related to blood transfusion uses the disease, the body develops disorders, which are bearers the manual of the American Association of Blood Banking of dispositions. Such dispositions are naturally associated (AABB) as a source. From the AABB website1 we down- with the entities’ existence, for example, the disposition of loaded thirty-two chapters that comprise the seventeenth the human body to get sick (Smith, 2008). There are edition of the manual. From this material, twenty-seven changes in the patient already, but not noticed. The etiologi- chapters were processed by the tool used for codification. cal process stage can be represented as follows: This material was select as a sample according to the stage ETIOLOGICAL PROCESS => produces => DISORDER of the research underway when writing this paper. Certainly, => bears => DISPOSITION. in future works, diseases processes and clinical finding will be considered. 2 Available at: . Access: Dec. 15, 2010 3 Available at: . Access: Nov. 30, 2011 4 Available at: . Access: July 23, 2011 1 Available at: . Accessed: July 23, 2010 2 Knowledge Acquisition in the construction of ontologies: a case study in the domain of hematology The course of disease stage starts with the clinical manife- Update data Wiki Page 3.2 updating after each vali- station of the disease (disposition). At this moment, the dis- dation K. engineer order manifests itself through symptoms, which the patient 4.1 integra- Characterize -Template Protégé is able to identify. Then, a doctor identifies the disease signs (4) tion between related genes, -K. engineer through a physical exam or through a report of the patient. granularities proteins, etc Refine- 4.2 connec- Connect data In this stage, it is possible to determine the clinical pheno- ment -Template Protégé- tion with top- with other on- - K. engineer type, that is, the principal observable characteristic of that level tologies disease. The course of disease stage can be represented as Table 1: KA list of steps proposed follows: DISPOSITION => realized in => PATHOLOGICAL PROCESS => produces => 4 RESULTS ABNORMAL BODY FEATURES. One evident result is the methodological list of steps de- In the therapeutic response phase, a sample is taken from scribed in the previous section, which has been tested and the infected part of the body in order to perform laboratory improved over the course of the research (Table 1). tests. At this point, it is possible to establish a treatment plan In the codification activity (extraction phase), from the texts so that the body may return to normality. The plan is the selected 369,741 tokens were automatically identified and result of a diagnosis founded in the interpretative process of related to parts-of-speech. Subsequently, in the information a clinical framework. The clinical framework is composed retrieval phase, 57 terms related to blood transfusion proce- of symptom representation records as well as physical and dures were identified. Table 2 depicts the top-five terms laboratory exam results. The therapeutic response stage can from the set of 57 terms retrieved, which were used as a be represented as follows: ABNORMAL BODY basis for starting interviews with experts: CONDITION => recognized as => SIGN AND SYMPTOM Term Frequency => used in => INTERPRETATIVE PROCESS. apheresis 124 The third phase of the proposed list of steps for KA, called phlebotomy 32 the validation phase, uses wiki science tools for collabora- cytometry 20 tive validation of candidate terms for an ontology. After the elicitation phase, according to the knowledge obtained, can- cordocentesis 16 didate terms are transferred to a wiki to then be validated by plasmapheresis 15 experts online. Table 1: top-five terms retrieved and absolute frequency The fourth stage of the proposed list of topics, called the refinement phase, uses a second template, also created using The rationale applied in the elicitation phase made it poss- Protégé-Frames. The goal was to record information about ible to understand the major stages of the disease manifesta- how to integrate the different levels of granularity required tion. Table 2 presents an example of blood disease analysis to understand a disease and its manifestations. This integra- following this rationale for Bernard-Soulier Syndrome: tion involves obtaining the relations between parts of the body that a certain disease affects, the related genes and the Etiological inheritance of a defect in the platelet membrane recep- process tor that affects the hemostasis related proteins. platelets with a glycoprotein Ib complex (GP Ib) ab- Finally, the steps put forward so far are gathered together, Disorder normality, either quantitative (absence of GP Ib) or thus creating the list of steps for KA. qualitative (mutation of GP1BA, GP1BB, GP9) Resources and Disposition Bernard-Soulier Syndrome (A, B or C) Phase Task Description people involved 1.1 build a Create a corpus -Medical texts Pathological abnormal platelet adhesion to the extracellular matrix corpus from texts -K. engineer process during the initial phase of plug formation (1) Symptoms bleeding, hematomas 1.2 codifica- Automatically -Sketch Engine tool Extrac- tion fragment texts -K. engineer excessive bleeding, gingival bleeding, menorrhagia, tion Signs 1.3 informa- Obtain terms -Sketch Engine tool purpura, epistaxis, gastrointestinal bleeding tion retrieval through suffixes - K. engineer Table 2: KA reasoning applied to a blood disease -Template Protégé 2.1 obtain Hold interviews and teachback; knowledge with experts An example of a Protégé-Frames template related to Ber- -K. engineer, experts -Matrix Techniques nard-Soulier syndrome is depicted in Fig. 4. (2) 2.2 know the Identify ex- -K. engineer and Contact terminology perts’ rationale expert 2.3 see ad- Understand how -Sorting techniques - hoc organiza- experts sort -Experts tion concepts (3) Obtain approval 3.1 validate -Wiki Page Valida- of terms ac- knowledge -Expert tion quired 3 Fabrício M. Mendonça et al. worth noticing that the difficulties in the validation stage did not occur among experts validating their own prior know- ledge. Rather, the majority of cases of non-validity occurred when an expert evaluated the knowledge provided by anoth- er expert. However, the differences did not seem irreconcil- able. In many cases, experts suggested referring to their own scientific publications to resolve outstanding issues. iv) The refinement stage was conducted in the same way as the contact stage. Indeed, it was conducted as an inter- view merged with work to understand the rationale behind and organization of the experts’concepts. When analyzing the results, one can conclude that this stage provides useful insights into the building of ontologies in terms of interope- rability. This is because the refinement stage is based on the premise of connection to top-level ontologies. Observations made over the course of all these stages al- lowed us to identify problems that occur in the KA process Fig. 4. Protégé-Frames template with example about blood for which solutions have been sought as the research has disease continued. These problems are the result of the influence of the following factors: Finally, it is worth mentioning that at the time this article i) factors related to the expert profile, such as: training, was being written, the ontology developed in OWL had experience and previous participation in similar projects, more than 300 classes and 50 properties, and practically all limitations in expertise; the methodological steps were up and running, providing ii) contextual factors, such as: cultural, geographical, po- data for different ontology parts. litical and financial issues, lack of access to information sources and deficiency in organizational structure; 5 DISCUSSION iii) factors related to the interaction between expert and In each stage of the KA process, as depict in Table 1, it is knowledge engineer, such as: short-term outlook (KA is possible to identify issues to be discussed: seen as “additional work”) and domain complexity; i) The extraction stage was undertaken mainly by a know- iv) factors that make recording results difficult, such as: ledge engineer using NLP tools applied to sources suggested non-approval by the expert of the results of the activity and by experts. As a means of producing a list of relevant terms constant advancement in the scientific field. in a domain, the extraction was useful in preventing the Concerning the proposed elicitation technique (section knowledge engineers from having to start interviews from 3.2), which is based on Scheuermann, Ceusters & Smith scratch. In general, the terms selected were useful for de- (2009), one can argue that there is a methodological pitfall scribing the domain according to the opinions of experts. when using a formal disease model to acquire knowledge. It ii) The contact stage is the heart of KA processes, since it could be argued that relevant domain knowledge could be is within this stage that experts share their knowledge. This missed by doing so, because what would be acquired is stage was conducted as a cycle that involved interviews something of a pre-conceived frame of meaning. However, interspersed with attempts to understand the rationale used we observed that some sort of structure was required to by experts to understand the phenomena in the domain. As conduct the activity and save time, mainly considering the part of this attempt, the knowledge engineer employed sort- limited availability of the experts. According to our expe- ing and matrix techniques. Regarding the interview based on rience in this study of case, knowledge missed for this rea- an ontological disease model, it is worth reporting that the son may be dealt with using complementary techniques. The results were very reasonable, insofar as the experts approved interviewees were not constrained when talking and teach- of the framework organized in the etiological process, back techniques were employed to give them the chance to course of disease and therapeutic response proposed by clear up misunderstandings and flaws. In addition, the onto- Scheuermann, Ceusters, & Smith (2009). logical disease model was used only to organize the inter- iii) The validation stage was conducted, in many cases, view and to make notes, not in an attempt to formalize during the interviews, mainly in the beginning of the knowledge directly. process when experts didn´t have experience with Wiki The NLP techniques applied aimed at collecting candidate pages. In general, the validation confirmed the interviews terms for the ontology, instead of trying to populate it di- and the teachback technique performed previously. It´s rectly. In this sense, the use of those techniques was impor- tant to obtain a first list of candidate terms. Even though 4 Knowledge Acquisition in the construction of ontologies: a case study in the domain of hematology NLP is not considered a good source for ontological know- Gaines, B.R. Organizational Knowledge Acquisition. In: Handbook on ledge, it may be useful when dealing with a large volume of knowledge management. Birkhäuser: Springer. 2003, 700 p. material. Another issue when using NLP was the size of our Hahn, U., Romacker M., Schulz, S. Medsyndikate - a natural language sample: in order to build a significant corpus, one should system for the extraction of medical information from findings reports. International Journal of Medical Informatics. 2002;67:63-74. have at least 10 million words, which are not available to us. Harris Z. On a theory of Language. The Journal of Philosophy, v. 73, n. 10, p. 253-276 1976. 6 CONCLUSION Hawkins, D. An analysis of expert thinking. International Journal of Man- This article has proposed a list of steps for KA, which are Machine Studies. v. 18, p. 1-47, Jan. 1983 based on techniques found in the literature. The steps in the Hersh, W. Information Retrieval: A Health and Biomedical Perspective list has been tested, proving their viability. The work de- 3ed: Springer 2009. scribed includes a project in which research was conducted Hoffman, R.R., Shadbolt, N.R., Burton, A.M., Klein, G. Eliciting know- ledge from experts. Organizational Behavior and Decision Processes. to identify the best practices for and difficulties in perform- v. 62, n.2, 1995. pg 129-158. ing the KA activities with hematology experts within the Hua, J. Study on Knowledge Acquisition Techniques. 2nd Inter. Symp. on scope of creating an ontology. The list of steps is a partial Intelligent Information Technology App. 2008. result that has been improved based on direct observation. Humphrey, S.M., Neveol, A., Browne, A., Gobeil, J., Ruch, P., Darmo- One conclusion we could draw from the overall expe- ni,SJ. Comparing a Rule-Based Versus Statistical System for Automat- rience is that KA is a very time-consuming and expensive ic Categorization of MEDLINE Documents According to Biomedical process. This may explain why it is neglected in many cas- Specialty. J Amer Soc for Inf Sci and Tech. 2009 Dec; 60:2530-9. es. In future work, we intend to further clarify in which con- Kelly, G.A. The psychology of personal constructs. New York: Norton, text each technique is most suitable. This could be done 1955. with assistance from experts, taking in account their time Lovis, C., Baud, R., Rassinoux, A.M., Michel, P. A., Scherrer, J.R..(1998). limitations. Regardless, in this case study, some techniques Medical dictionaries for patient encoding systems: a methodology. Art. Int.in Medicine. Vol. 14, Issue 1, pp. 201-214. were chosen, as was mentioned in last column of Table 1. Milton, N., Clarke, D., Shadbolt, N. Knowledge engineering and psycolo- The list of topics has been successfuly applied in other re- gy. Int. J. of Human-Computer St., v. 64, n. 12, p. 1214-1229. 2006. lated domains. It appears to be a systematized alternative for Payne P.R, Mendonça E.A, Johnson S.B, Starren J.B. Conceptual know- creating ontologies using a rational means of approaching ledge acquisition in biomedicine: a methodological review. J Biomed experts. Inform. 2007. v 40, n. 5, p. 82–602. Sebastiani, F. Machine learning in automated text categorization. ACM ACKNOWLEDGEMENTS Comput Surv. 2002 Mar; 34:1-47. This work is partially supported by Fundação de Amparo à Scott, A.C., Clayton, J.E., Gibson, E.L. A practical guide to knowledge acquisition. Addison–Wesle, 1991, 509 p. Pesquisa do Estado de Minas Gerais (FAPEMIG), Governo Shadbolt, N. Eliciting Expertise. In: Evaluation of Human Work. Ed. Tay- do Estado de Minas Gerais, Brazil, Rua Raul Pompéia, lor & Francis. 2005. nº101 - São Pedro, Belo Horizonte, MG, 30.330-080, Brazil. Shadbolt, N., Swallow, S. Epistemics: Knowledge Acquisition. Shaw, This work partially supported by CNPQ (Conselho Nacional M.LG. and Gaines, B.R. Soft.Engineering Journal, 1996. p. 149-165. de Pesquisa e Desenvolvimento) Smith, B. New Desiderata for Biomedical Terminologies. In Munn, K.; Smith, B. (Ed.). Applied Ontology. Frankfurt: Verlag. 2008. pp. 21-39 REFERENCES Scheuermann, R.H., Ceusters, W., Smith, B. Toward an ontological treat- Almeida, M.B., Proietti, A.B., AI J., Smith, B. The Blood Ontology: an ment of diasease and diagnosis. Proceedings 2009 Summit on Transl. ontology in the domain of hematology. (2011) Proccedings of ICBO. Bioinf., San Francisco, CA, pp. 116-120. Boose, J.H. Knowledge acquisition tools, methods, and mediating represen- Taira, R.K and Soderland, S.G. Statistical natural language processor for tations. In: Japanese Knowledge Acq. for KBS, JKAW, 1. 1990. medical reports. J. Am. Med. Infor. Ass. 1999:970-4. Boose, J.H and Gaines, B.R. Knowledge Acquisition for Knowledge-Based Verspoor, K., Bretonnel, C. K., Goertzel, K., Mani, I. Linking natural lan- Systems. 1989. Machine Learning, v. 4, p. 377-394. guage processing and biology: towards deeper biological literature Campbell, K.E. et al. Representing thoughts, words, and things in the analysis. BioNLP ’06: 2006; Morristown USA: ACM; 2006. p. iii-iv. UMLS. J Am Med Inform Assoc., v. 5, n. 5, p. 421–31, 1998. Wang Y, Sure Y, Stevens R, Rector, A. Knowledge elicitation plug-in for Compton, P. and Jansen, R. A philosophical basis for knowledge acquisi- Protégé: Card sorting and laddering. In: The Semantic Web, ASWC, v. tion. Knowledge Acquisition. European KA for KBS. 1989. 4185, p. 552-565, 2006. Available from: Corbridge C, Rugg G, Major N, Shadbolt N.R, Burton A. Laddering: tech- http://dx.doi.org/10.1007/11836025_53 nique and tool use in knowledge acquisition. J. of Knowledge Acqui- sition, 1994, p. 315-341. Available from: https://blog.itu.dk/SLR- F2010/files/2010/07/paper-1-pages-1-12-15.pdf Friedman, C, Shagina, L., Lussier, Y., Hripcsak, G. Automated encoding of clinical documents based on NLP. Journal of the American Medical In- formatics Association. 2004 Sep-Oct;11:392-402. 5