Natural Language Definitions for the Leukemia Knowledge Domain Amanda Damasceno de Souza Mauricio Barcellos Almeida Federal University of Minas Gerais, UFMG Department of Theory and Management of Information Belo Horizonte, Minas Gerais, Brazil Federal University of Minas Gerais, UFMG amanda@ufmg.br Belo Horizonte, Minas Gerais, Brazil mba@eci.ufmg.br Abstract— The creation of natural definitions is a phase of any methodology to build formal ontologies. In order to reach II. BACKGROUND: DEFINITIONS IN ONTOLOGIES formal definitions, one should first create natural language The formulation of definitions has been studied by definitions according to sound principles. We gather a set of philosophers and linguists since ancient times. It connects principles available in literature and organize them in a list of philosophical notions – for example relating to the natures or stages that one can use to create good definitions in natural essences of things – with other constructs such as terms, language. In order to test the set of principles, we conducted a concepts and meanings [6]. case study in which we create definitions in the domain of cancer, more specifically, definitions for acute myeloid leukemia. After The search for a proper definition of terms used to creating and validating the definition of this specific kind of represent biomedical entities is connected to the process of leukemia, we offer remarks about the experiment. learning. In order to define a term, one needs to have previous knowledge about the subject, to know both the context in Keywords—Natural Language Definitions; Biomedical which a term is used and the associated technical jargon [7]. ontologies; Leukemia. The difficulty in the activity of creating definitions in I. INTRODUCTION ontologies lies in the lack of trained people for the construction of ontologies. Indeed, this type of task gives rise to several In Information Science, ontologies have attracted the theoretical and practical issues [8]. Definitions for terms in interest of researchers working on knowledge organization ontologies should be formulated according to familiar logical systems. Ontologies are employed to organize information and principles [9, 10, 11]. knowledge for purposes of information retrieval [1-2]. Ontologies also provide terminological standardization with the aim of organizing scientific knowledge and building III. METHODOLOGY repositories that are designed to foster interoperability between In order to formulate proper definitions for terms related to electronic information systems. Hematologic neoplasm, we devised a three-step procedure as follows. When building ontologies, an important activity is the formulation of definitions for domain-specific terms. The OBO Foundry [3] principles suggest that one should first provide A. The Sample Collection natural language definitions and then provide the logical First, the set of terms relating to AML was selected by formulation [4]. Existing methodologies for construction of identifying which terms in BLO deal with hematologic ontologies do not present satisfactory guidelines to create neoplasia [5], amounting to 43 classes: 25 AML terms, 6 definitions. myeloblastic syndrome terms and 12 myeloproliferative neoplasm terms. The goal of this paper is to propose methodological principles in order to systematize the process of definition B. Knowledge Acquisition creation in biomedical ontologies. We describe a case study on creating natural language definitions for leukemia, a disease Second, we retrieved definitions in natural language for which includes as subtypes acute myeloid leukemia (AML), leukemia from traditional information sources such as the NCI the myelodysplastic syndromes and the myeloproliferative Thesaurus [12]; the NCI Dictionary of Cancer Terms [13]; the neoplasm. Here, we focus on AML. Medical Subject Headings (MeSH) [14]; Medscape [15]; Ontobee [16]; the Disease Ontology [17]; the Gene Ontology The present paper arises out of our work on the Blood [18]; and a text-book [19]. Ontology (BLO) project, a resource that allows the exploration of relevant information for research in hematology [5]. The C. List of Stages to Create Definitions in Natural Language BLO encompasses terms representing hematologic neoplasia, including leukemia and lymphoma. Finally, we utilized the results of A and B to formulate a natural language definition, following well established practices identified in literature, such as: genus-differentia and This work is partially supported by Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG), Governo do Estado de Minas Gerais, Brazil, Rua Raul Pompéia, nº101 - São Pedro, Belo Horizonte, MG, 30.330- 080, Brazil. essence [20]; formal relations [21,22]; necessary and sufficient RARA.), as well as based on the lineage and the maturation conditions [11]; inheritance, intelligibility and circularity [9]; degree. logical definitions [8]; issues in definitions and logical definitions [4]; biomedical definitions [32]; textual and formal In the domain of cancer, these three diagnostic aspects – definitions [23]. The mentioned stages are: 1) To elect the morphological, immunological and immunophenotyping – candidate term to be defined; 2) To obtain a preliminary were essential to perform the task of defining. However, we definition; 3) To establish the superior genus; 4) To establish learned from experts that morphology still represents the the essential characteristic; 5) To formulate the definition; 6) central criteria in distinguishing leukemia types. To check necessary and sufficient conditions; 7) To check non- In cases where several features were defined by the circularity; 8) To check multiple inheritance. diagnosis, another criterion was required in order to find the essence, for example, a criterion based on a prognosis. We IV. RESULTS reviewed what characteristics induced a prognosis of the disease and the most important was considered the essential We describe the list of stages, explaining how to proceed in characteristics. The participation of an expert was crucial to each stage to create natural language definitions of AML using confirm the essential feature. the example of the first class AML: “AML derives from an uncontrolled proliferation of the 1) To elect the candidate term to be defined myeloid lineage and their precursors” First, one should choose the candidate term to be defined according to techniques of knowledge acquisition [24]. In our 5) To formulate the definition in the form case study, we followed the list of stages presented in section IIIC for the term “Acute Myeloid Leukemia”. In this stage, the information gathered from stages 1 to 4 was applied to formulate a preliminary version of the 2) To obtain a preliminary definition definition. The definition has the form: S = Def. a G which is In this stage one should perform a search in specialized Ds (where “G” (genus) is the superior term of “S” (species) literature to obtain information about the term. The sources and “S” is the term under definition; and “D” is an essential may be textbooks, papers, dictionaries, encyclopedias, thesauri characteristic, that is, the differentia). So, “S” is the class of and ontologies. In our case study, a librarian helped to select leukemia to be defined, “G” is the general class and “D” is the the suitable bibliography and to extract information required to differentia that characterizes an instance S of D in the context establish the genus and differentia of the term. The main source of leukemia. used to obtain a preliminary definition of AML was the NCI- The formulation of the definition was initiated by writing Thesaurus [12], from which we obtain the following definition: the term to be defined followed by its genus (stage 3) and its “AML is an aggressive (fast-growing) disease in which too differentia (stage 4). Then, we corrected the first version many myeloblasts (immature white blood cells that are not definition from a grammatical point of view, adding or lymphoblasts) are found in the bone marrow and blood. Also removing parts of the text obtained in the first stages. We also called acute myeloblastic leukemia, acute myelogenous chose preferential terms and eliminated redundant words. After leukemia, acute nonlymphocytic leukemia, AML, and ANLL”. some changes in the first versions, the result of stage 5 was: 3) To establish the superior genus “An acute myeloid leukemia is-a hematopoietic neoplasm that derives from an uncontrolled proliferation of the myeloid We determined the genus by seeking to identify a common lineage and their precursors”. feature of the selected term. In the case of leukemia, the common feature is the existence of an abnormal derivation of 6) To check necessary and sufficient conditions the myeloid lineage that occurs in each AML. So, we The verification was performed through the following established the basic relation: Acute Myeloid Leukemia expression: to be an A is a necessary condition to be a B, then Hematopoietic Neoplasm. each B is an A; to be an A is a sufficient condition to be a B, 4) To establish the essential characteristic then each A is a B [11]. This expression means that A” represents the essential characteristic of the definition, and “B” We used the notion of differentia in order to define the represents the term under definition. essential characteristics that mark the distinction of the entity under definition from other entities in the hierarchy. These To be an AML is a necessary condition to “derive from an essential characteristics are often difficult to be found. So, we uncontrollable proliferation of a myeloid lineage and its first studied the domain, then we sought support from a cancer precursors, that is, each AML derives from an uncontrollable expert. Thus, we analyzed the characteristic that best proliferation of a myeloid lineage and its precursors”. represented that type of pathology. To be an AML, a sufficient condition is to “derive from an The differentia between each class of leukemia was uncontrollable proliferation of a myeloid lineage and its obtained from a diagnosis based on morphological criteria (cell precursors, that is, each AML derives from an uncontrollable type), immunological criteria (ICD 13, ICS 33, etc.) and proliferation of a myeloid lineage and its precursors”. cytogenetic criteria (abnormalities t8, t21, q22, q22-PML 7) To check the principle of non-circularity In this stage, we scrutinized the definition for circularity. biphenotypic cases. Nevertheless, we classified these cases Roughly, circularity is a situation in which a term is employed as subtypes of AML derived from a myeloid lineage to define the very same term. following the FAB classification. 8) To check the principle of multiple inheritance We applied ontological principles in the formulation of All kinds of AML descent from the myeloid lineage definitions in order to test the proposed method. After because we are dealing with a clonal disease. So, we could testing, we noticed issues regarding the methodological steps define AML without multiple inheritance. and validation by the domain expert. For example, we realized that to obtain definitions of leukemia from only one Once the eight steps have been presented, brief dictionary was not enough. So, we resorted to additional considerations regarding the data validation by experts are in sources such as pathology and hematology textbooks order. In our case study, the validation was performed by a [19,30], leukemia classifications [27,28,29] and scientific pediatric oncologist specializing in leukemia. We first asked papers. the expert whether the derivation from an uncontrollable proliferation of a myeloid lineage (and its precursors) is the All selected sources had general definitions of leukemia, main characteristic of an AML. The oncologist reported: but we detected categorization issues described by Seppälä “To define leukemia we need three characteristics: and Ruttenberg [4]: circular and intangible definitions, use of morphological, cytogenetic and immunophenotyping. This technical terms and multiple definitions for the same term. In specific case has a unique parent, namely, the myeloid lineage order to soften these issues, we decided in some case to since it is a clonal disease. In cases that a leukemia is explain the meaning of other terms, for example, genetic biphenotypic or bi-lineage, it has two origins since it presents mutations, then creating more definitions. In order to several clonal cell populations. However, this kind we are understand leukemia, we conclude that one must explore evaluating presents only the myeloid lineage. Even if the relationships to areas of pathology, cancer diagnosis, descent from the myeloid lineage was minimal, it presents only etiology, and so forth. one differentiation.” As mentioned previously, the need for guidance from a domain expert is noteworthy. The definitions presented to V. DISCUSSION the expert for validation and her observations were used both In this section we present some remarks about our to amend the definitions and to review the method. The experience of creating definitions of AMLs. crucial aspect of the validation process was finding the essence of acute myeloid leukemia classes. The process was After our search for definitions in healthcare and biology conducted through personal interviews with a pediatric information sources, we analyzed the definitions found oncologist, who employed her experience to confirm the according to a set of criteria that considers: multiple essential characteristics and to determine necessary and definitions, lack of proper characterization, intangible sufficient conditions. Our case study relied on only one definitions, circular definitions and presence of technical expert, but we are certain that a true validation should terms [4, 9]. consider several specialists. In analyzing the definitions of leukemia presented in the The essence of entities was based on the diagnosis literature, we recognized that the definitions found were not criteria of the FAB classification [27] and the WHO [28,29]. satisfactory. For example, definitions found in the Disease The essence was mainly defined by morphological Ontology [17] are too general, and Ontobee [16] lacks characteristics except in the case of the class “acute myeloid natural language definitions for several leukemia types. In leukemia with recurrent genetic abnormalities”. In this the NCI-Thesaurus [12] we found definitions in natural particular case, the essence was based on cytogenetic language for all classes of our sample. It offers criteria of abnormalities. The WHO [29] decided to maintain clarity, consistency, coherence, and extensibility [25,26]. cytogenetic abnormalities as the main characteristics for this Some classes were so complex in definition that our first class. Therefore, when reviewing the necessary and attempt to classify them resulted in circularity. With respect sufficient conditions, we recognized the influence of to technical terms, it was necessary to clarify their meaning. leukemia classifications issues. For example, in the just For example, to understand how cells suffer changes, one mentioned class “acute myeloid leukemia with recurrent should consider three types of mutation: translocation, genetic abnormalities” the requisite characteristic is that deletion or inversion. This requires a deep knowledge of the every AML was a carrier of a genetic mutation. domain in order to establish the genus and the differentia. Cancers, especially leukemia, are complex diseases. A As leukemia is a clonal disease, meaning the lineage single morphological characteristic – for example, the could be either myeloid or lymphoid, we defined the presence of a percentage of blasts – is not enough for disease myeloid inheritance according to well-known classification diagnosis and treatment. This fact confirms the relevance of criteria [27,28,29]. However, there are types of acute defining a well-founded and robust formal vocabulary to leukemia that have both myeloid and lymphoid lineages. represent entities in the leukemia field. Thus, these cases are considered mixed, hybrid or VI. FINAL REMARKS [9] J. Köhler, K. Munn, A. Ruegg, A Skusa,and B. Smith. “Quality control for terms and definitions in ontologies and taxonomies”. BMC This research gathered some principles already present in Bioinformatics., vol.7, pp. 212. Apr. 2006. the literature of formal ontologies to propose a method with the [10] T.R.Gruber. “Toward Principles for the Design of Ontologies Used for aim of systematizing the process of creating definitions. After Knowledge Sharing.” 1993. Stanford Knowledge Systems Laboratory. our practical case study, we recommend the method be http://citeseerx.ist.psu.edu. reviewed and improved upon; the terminological complexity of [11] B. Smith. “Introduction to the Logic of Definitions”. [International leukemia made the work difficult and laborious. Workshop on definitions In Ontologie, DO 2013, July 7, Montreal, 2013]. http://ceur-ws.org/Vol-1061/Paper5_DO2013.pdf. As previously stated, the main challenge in our case was [12] NCI – Thesaurus. http://ncit.nci.nih.gov/ determining the essence of leukemia´s classes, since the [13] National Cancer Institute. “NCI Dictionary of Cancer Terms: acute domain approached is one that exhibits more diversity of myeloid leukemia”. http://www.cancer.gov/ phenotypic and genetics changes at diagnosis among cancer [14] MeSH. http://www.ncbi.nlm.nih.gov/mesh studies. Even the FAB classification [27] and the WHO [29] [15] Medscape. http://www.medscape.com/ have difficulty in categorizing, defining and diagnosing [16] Ontobee. “Leukemias”. 2014. http://www.ontobee.org. subtypes of AML. Efforts to better categorize myeloid [17] Disease Ontology (DO). “Leukemia”.2014.http://disease-ontology.org/. neoplasm exist, as pointed out by Varian, Harry and Branning [18] The Gene Ontology. http://www.geneontology.org/. [28]. Varian et al. [29] published an update to the WHO [19] K. Reichard, Wilson, Czuchlewski, Vasef, Zhang, and Hunt ."Myeloid classification proposing the use of both morphology and neoplasm". In: Diagnostic pathology blood and bone marrow. immunophenotyping information to define leukemia, as well as Monitoba: Amirsys, 2012. cap.9.pp. 2-208. catechetical, genetic and clinical characteristics. [20] J.Michael, J.L. Mejino Junior, and C. Rosse. “The role of definitions in biomedical concept representation”.Proc AMIA Symp. pp. 463-7.2001. With this experience in mind, we intend future research to [21] B.Smith, W. Ceusters, B. Klagges, J. Kohler, A. Kumar, J. Lomax, C. further contribute to a theory of formulating definitions of Mungall, F. Neuhaus, A,L,Rector, and C. Rosse. “Relations in ontologies as well the standardization of definitions in the field biomedical ontologies”. Genome biology . vol.6, n.5, p.p.R46.2005. of cancer. In doing so, we contribute to the field of biomedical [22] D. Soergel. “Knowledge Organization Systems: Overview”. 2014. ontologies and healthcare. The complete results and findings of http://www.dsoergel.com/SoergelKOSOverview.pdf. this work can be found on the thesis [33]. [23] A. Petrova, Y. Ma, G.Tsatsaronis, M.Kissa,F.Distel,F.Baader, and M. Schroeder. “Formalizing biomedical concepts from textual definitions”.J Biomed Semantics., vol.6, pp.22. April. 2015. ACKNOWLEDGMENT [24] F. M. Mendonça, K.C. Cardoso, A. Q. Andrade and M.B. Almeida. A. D. S. thanks to Dr. Joaquim Caetano Aguirre Neto, Md, “Knowledge Acquisition in the construction of ontologies: a case study that provided support in the textual definition validation, as in the domain of hematology”. Proceedings of the International Conference of Biomedical Ontologies (ICBO), 2012, Austria. well as to Dr. Dagobert Soergel, PhD, for his useful advices. [25] M. Uschold. “Building Ontologies: Towards a Unified Methodology (1996)”. http://citeseerx.ist.psu.edu/ REFERENCES [26] T. B. Gruber. “A translation approach to ortable ontologies”. [1] D. Soergel. “The rise of ontologies or the reinvention of classification”. J Knowledge Acquisition, vol. 5, n. 2, p. 199-220, 1993. Am Soc Inf Sci.,vol. 50, n.12, pp.1119-20.1999. http://www.dbis.informatik.hu- berlin.de/. [2] B.C. Vickery. “Ontologies”. Journal of Information Science, vol.23, [27] National Cancer Institute. “FAB”. http://www.cancer.org/cancer/. n.4,pp.277-286. 1997.http://mba.eci.ufmg.br/downloads/recol/277.pdf [28] J. W. Vardiman, N.L . Harris, and R.D . Brunning. “The World Health [3] B. Smith, , M. Ashburner, C. Rosse, J.Bard, W. Bug, W.Ceusters, L.J Organization (WHO) classification of the myeloid neoplasms”.Blood. Goldberg, K.Eilbeck, A.Ireland, C.J Mungall, N.Leontis, P.Rocca-Serra, vol.100, n.7, pp.2292-302. October. 2002. A.Ruttenberg, S.-A.Sansone, R.H Scheuermann, N.Shah, P. L Whetzel [29] J.W Vardiman, J. Thiele,D.A.Arber, R.D. Brunning, M.J. Borowitz, A. and S. Lewis. “The OBO Foundry: coordinated evolution of ontologies Porwit, N. L.Harris, M. M. Le Beau, E. Hellstrom-Lindberg, A. Tefferi, to support biomedical data integration”.Nat Biotechnol. vol.25, n.11, and C. D. Bloomfield. “The 2008 revision of the World Health pp.1251-5.nov. 2007. Organization (WHO) classification of myeloid neoplasms and acute [4] S. Seppälä; R. Ruttenberg. “Survey on Defining Practices in Ontologies : leukemia: rationale and important changes”. Blood. vol.114, n.5, p.937- Report. in preparation of the International Workshop on Definitions in 51. July. 2009. Ontologies”. International Conference On Biomedical Ontology (ICBO [30] R. Hoffman, E.J. Benz Jr, L.E. Silberstein, H.Heslop, J. Weitz, and J. 2013), Montreal, Canada, http://definitionsinontologies.weebly.com/ Anastasi. “Hematology: Basic Principles and Practice”. 5th Ed. New [5] M. B.Almeida, A. B. Proietti, B. Smith, and J. Ai. “The Blood Ontology: York: Churchill Livingstone,2008.pp.2560. an ontology in the domain of hematology”. [ICBO 2011; Buffalo, USA]. [31] C.Rosse, J.L.V.Mejino Junior. “A reference ontology for biomedical [6] A. Gupta. “Definitions”. The Stanford Encyclopedia of Philosophy informatics: the foundational model for anatomy”.Journal of (Winter 2008 Edition), Edward N. Zalta , Ed., UR. The Metaphysics Biomedical Informatics. vol.36, pp.478-500. Research Lab :Stanford. http://plato.stanford.edu/entries/definitions/. [32] S.Seppälä, Y. Schreiber and A. Ruttenberg. “Textual and logical [7] N. Swartz. “Definitions, Dictionaries, and Meanings”. This revision: definitions in ontologies”. CEUR Workshop Proceedings, vol.1309, November 8, 2010. "http://www.sfu.ca/~swartz/definitions.htm. Houston, TX, USA, October 6-7, pp. 35-41. [8] G. Tsatsaronis, A. Petrova, M. Kissa, Y. Ma, F.Distel, F.Baader, and M. [33] A.D.Souza. “Systematizing of the Methodology of Creating Formal Schroeder. “Learning Formal Definitions for Biomedical Concepts”. Definitions in Biomedical Ontologies: an Investigation in Acute [Srinivas, K;Jupp, S. (eds) Proceedings of the 10th OWL: Experiences Myeloid Leukemia Domain”.[Thesis] Master in Information science. and Directions Workshop (OWLED 2013), May 2013].https://ddll.inf.tu- School of information Science, Federal University of Minas Gerais, dresden.de/web/LATPub509/en. Brazil, 2016. (in Portuguese)