An Ontology for TNM Clinical Stage Inference Felipe Massicano2 , Ariane Sasso1 , Henrique Amaral-Silva1 , Michel Oleynik3 , Calebe Nobrega1 , Diogo F. C. Patrão1 1 CIPE - A. C. Camargo Cancer Center 2 IPEN - USP {djogo,ariane.sasso,henrique.silva,cnobrega,michel}@cipe.accamargo.org.br massicano@gmail.com Abstract. TNM is a classification system for assessment of progression stage of malignant tumors. The physician, upon patient examination, classifies a tumor using three variables: T, N and M. Definitions of values for T, N and M de- pend on the tumor topography (or body part), specified as ICD-O codes. These values are then used to infer the Clinical Stage (CS) and reflect the disease pro- gression, which can be 0 (no malignant tumor), IS (in situ), I, II, III, or IV. The rules for inference are different for each topography and may depend on other factors such as age. With the objective of evaluating missing CS information on A. C. Camargo Cancer Center databases, we developed an open ontology to represent TNM concepts and rules for CS inference. It was designed to be easily expansible and fast to compute. 1. Introduction Originally developed in 1958 and since then maintained by the Union for International Cancer Control (UICC), the TNM staging system is a cancer classification scheme used mainly to predict survival rates given the disease severity. Based on the fact that patients with localized tumors present higher survival rates when compared to patients with distant metastasis, the TNM staging system aims to help doctors with treatment planning, disease prognosis, interpretation of treatment results and also to facilitate information sharing and improve cancer research [Sobin and Wittekind C 2002]. The classification is based on three main discrete variables: T (0-4), for the eval- uation of the primary tumor extension; N (0-3), for the appraisal of the presence and the extension of metastasis in regional lymph nodes; and M (0-1), to annotate the absence or presence of distant metastasis. Some topographies include an additional character in the range a − d for specifying subcategories. Additional characters can also be included to define the information source (clinical exam or pathology biopsy); the diagnosis stage (before/after treatment, after recurrence or through autopsy); and the existence of multi- ples tumors in the same site. Moreover, other symbols describe optional lymphatic and venous invasion, the histological grade, the metastasis site, presence of isolated tumor cells, sentinel lymph node invasion status, the degree of certainty and the presence of residual tumor after the treatment [Sobin and Wittekind C 2002]. Additionally, each topography has rules for mapping the TNM staging into one variable called clinical stage. The clinical stage ranges from 0 to IV, with an additional character for some sites. Although rules differ for each topography, higher clinical stages correlates with worse prognosis. Therefore, its determination is a central point in the cancer diagnostic process. The rules for clinical staging inference, standardized by the TNM staging system, should be used by the physicians during the medical appointment; however, many factors contribute to this not being largely adopted, such as: resistance by physicians to extra paperwork, physicians uncertainty concerning the current staging system and lack of reg- ulatory processes to enforce compliance with the standard [Schmoll 2003]. Many efforts have been made lately to reach that, including its recommendation by specialized medical societies and its use as a mandatory prerequisite for quality accreditation on oncology care [Neuss et al. 2005]. Moreover, the TNM staging information is also crucial for cancer research. As the different clinical stages indicates better or worse response to certain treatments and better or worse prognosis, cancer studies usually focus on diseases of a specific tissue, and a specific clinical stage. If the clinical database does not contain this information for a relevant fraction of the patients, the researchers may have to resort to manually assessing the patient records to find out the sample size. Since the rules for clinical stage coding are explicitly defined in the TNM publi- cation, it is possible to create a computer program to automatically evaluate them. Such a program would validate existing values, or even provide this information when it is miss- ing. However, representing all rules directly on a computer programming language is an extenuating and repetitive task, and may lead to code maintenance issues. In addition, it would be difficult to a oncology expert, untrained in computer programming, to validate the algorithms. In order to overcome these difficulties, a proposal to model the concepts, descrip- tions and rules in TNM clinical stages is to use ontologies. In summary, the term ontology means a specification of a conceptualization and it has been applied to create standardized dictionaries in several fields. [Gruber 1993]. Standardized ontologies have been developed in many areas in such a way that domain experts can share and annotate information in their respective fields. In medicine, well-known standardized and structured vocabularies such as Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT) 1 , RadLex [Langlotz 2006], Unified Medi- cal Language System (UMLS) [Lindberg et al. 1993], Medical Subject Headings (MeSH) [Nelson et al. 2001] and others have been used for clinical and research purposes. Al- though new general and specialized ontologies are emerging fast, there is no published ontology yet that approaches the TNM clinical stage coding problem. Yet, some ontolo- gies may represent some of the TNM concepts. The National Cancer Institute Thesaurus (NCIt) is a reference terminology that covers the clinical care, basic and translational research, public data and also the admin- istrative domain regarding the National Cancer Institute (NCI). It was built upon the NCI Metathesaurus from the UMLS and it is based on description logic with relationships be- tween semantically rich concepts [Smith et al. 2005]. It is coded on OWL Lite, a subset of OWL-DL with enough complexity to represent the ontology data [Bechhofer et al. 2004]. It provides some of the TNM concepts for 6th and 7th edition and each topography has its 1 http://www.ihtsdo.org/snomed-ct own T, N, M and CS classes with annotations in English. When a concept has the same definition in the 6th and 7th edition, it is defined as a single class, or else specific classes for each version are defined. There is no definition of axioms for inference of Clinical Stage based on values of T, N and M. The SNOMED CT is a vocabulary comprising more than 310.000 concepts hi- erarchically organized. There are concepts to represent all TNM (including individual definitions for T, N, M and CS for each topography), however, there are no compositional rules connecting the T, N, M and the topography to the CS. Moreover, its license is not open and there is no official or non-official translation to Portuguese. Dameron et al. propose the creation of an ontology for automatic grading of lung tumours using OWL-DL description logic language, inspired by the controlled vocabu- lary for cancer, the NCIt and also by the Foundational Model of Anatomy (FMA) for its anatomical decomposition [Dameron et al. 2006]. Marquet et al. also developed an on- tology based on the NCIt for automatic classification of glioma tumors using the WHO grading system. Their ontology contained 243 classes (234 of them corresponding to NCIt classes) which correctly classified simulated tests and graded correctly ten clinical reports out of eleven used on the test for clinical data [Marquet and Dameron 2007]. The links mentioned on both manuscripts for downloading the ontologies were not active at the time of this writing. The TNM ontology [Boeker et al. 2014] is a thorough representation of the TNM concepts for breast cancer using OWL-DL with SRI expressivity. The focus there was representation of the clinical meaning of each concept: T, N and M, with links to the Foundational Model of Medicine [Rosse and Jr. 2003]. They depict how to represent the tumor, the lymph node, distant metastasis, the organ locations specified and the tumor invasion pattern. Complete as it is, there is no rules for inference of clinical stage, nor the concepts related to the latter. In this work we present an ontology for allowing inference of the TNM clinical stage of tumors, based on given values of T, N, M, the ICD-O topographic code and other information. This ontology should provide annotations with the original descriptions from the reference, and links to the NCIt ontology wherever applicable. 2. Materials and Methods The first step was to identify the most common topographies on A. C. Camargo Cancer Center patients. Upon interview with an oncologist expert, we created a list of the ten most relevant topographies for research on this institution. We used the TNM 6th edition, because most of the relevant databases in the institution used this version of the coding system. To achieve the goal of a fast-computing ontology, we kept its expressivity at the bare minimum while preserving the intended meaning of concepts. We used only sub- class, intersection, equivalence, disjunction between classes, and object properties. As seen on Figure 1, the ontology is divided in four files (Figure 1): the main ontology, with the general TNM concepts and the imports of all others; the ICD-O topography, with the topographic classes referred by the TNM; a file with the annotations and finally a file with the clinical stage inference axioms. Figure 1. TNM Ontology components and imports diagram. The concepts for representing T, N, M and CS were created as an hierarchy of classes; the root concept TNM 6th edition, and its direct subclasses T, N, M and EC (the portuguese acronym for CS). There are subclasses that describes the general classification for all tumors, according to the introduction of the TNM reference. There may be an additional level of subclasses for representing concepts such as T1b or CS IIIa (as defined in some topographies such as breast cancer). We called all those the general staging classes. See Figure 2. Figure 2. Class hierarchy for TNM concepts. As the clinical stage rules depends on the tumor topography, the axioms for infer- ence would need reference to ICD-O topography concepts. We could not find any ICD-O ontology available, and it was beyond the scope of our work to create one. However, as ICD-O topographic codes were based on ICD-10 cancer codes, we reused an ICD-10 ontology, available on the BioPortal2 . We kept only the C00-C80 range of codes, re- moved some undefined codes within this range (such as C43, C78 and C79) and added C42 (as described in the ICD-O introduction). We also changed the ontology namespace and changed the label annotation property to skos:label. Reference to the prior ontology was kept. In Figure 3 there is a depiction of the ICD-O ontology. To represent actual patient data, there should be an instance of class Patient, re- lated to one or more instances of class tumor. In order to use the ontology to represent data, an instance representing the tumor should be created and related to subclasses of T, N, M, CS and ICD-O Topography classes. Following the TNM guidelines for staging, a patient with two primary tumors should be represented as one instance of a patient linked to two tumor instances; however, a patient with one tumor that metastasised should have only one tumor instance. The patient instance should be linked to the tumor instances by an object property. A tumor should not belong to more than one topography class. First, it does not make clinical sense: a tumor should be located on a specific location or organ. It may 2 http://bioportal.bioontology.org/ontologies/ICD10 Figure 3. Class hierarchy for ICD-O concepts. happen to spread itself to neighbour tissues or the precise location maybe be dubious (such as the gastroesophageal junction). In these cases the most probable tumor location should be selected and linked to the instance. The ICD-O Topography ontology states disjunction axioms for all their classes, preventing a tumor instance to belong to two topographic locations at once. As each topography has different definitions for individual values of the general staging classes T, N and M, we created a script to parse a text file and create a RDF/XML file defining specific staging classes and inference axioms for a pair of T, N or M val- ues and one topography, plus annotations using rdf:Description annotation property. We manually created text files based on the TNM definitions. The axioms are subclasses re- lating the specific staging classes to the intersection of one general staging class and one topography class. Whenever a corresponding NCIt concept was available, it was linked to the spe- cific staging class by the property owl:equivalentTo (see Figure 4). Not all concepts de- fined on TNM were present on NCIt, for instance, the T4 for Breast Cancer. C50 u M 1 v C50 M 1 ≡ N CIt : C49009 Figure 4. Relation between an annotation from the current ontology and a NCIt class. The standard procedure at the A.C. Camargo Cancer Center is to encode the TNM staging and the ICD-O topography during clinical attendance. As a result, structured information about the clinical stage is not promptly available in its databases. Based on this, we use the previously constructed inference axioms that considered the values of T, N, M and ICD-O to infer the clinical stage (CS) values. The format starts with a first line containing the name of the determined clinical stage class. The second line contains one or more topography classes, which are linked to that clinical stage class and separated by a space character. The other lines have a relation of conjunction between the group T, N and M with each specified ICD-O topography. See Figure 5 for an excerpt of these axioms. C50 u T is u N 0 u M 0 v BreastCancer CS 0 C50 u T 1 u N 0 u M 0 v BreastCancer CS I C50 u T 2 u N 1 u M 0 v BreastCancer CS IIB C50 u N 3 u M 0 v BreastCancer CS IIIC C50 u M 1 v BreastCancer CS IV Figure 5. Axioms for inference of clinical stage (CS) based on ICD-O topography and T, N and M classes. For testing purposes we created another ontology with subjects and patients and assignments to specific classes of this ontology. For each subject we included a topo- graphic class which includes the TNM for each test according to the example below. patientT est00100 : P atient u hasT umor value patientT est00100 T umor1 patientT est00100 T umor1 : C50 u T is u N 0 u M 0 After the inference, we can check the TNM annotation classes and also the respec- tive NCIt code class. Thus we reach the ontology objective informing the inferred class to their respective clinical staging. We created a script to generate 566 tests based on the text mappings, as instances of Patient class with exactly one Tumour instance related to it. There were one test for each possible combination of T, N, and other variables for which could be inferred a clinical stage. We created then two queries, one for assessing test instances without any clinical stage inferred (it should have none) and other listing the inferred plus the expected clinical stage for each test. The software we used to create the ontologies was Protégé 3 . The scripts for the creation of OWL files based on text files were developed in Python. The inferences were computed using Pellet4 . 3. Results The resulting TNM ontology is divided in four files: main TNM concepts, ICD-O to- pography, annotations and clinical stage axioms. The main TNM ontology contains the 3 http://protege.stanford.edu/ 4 https://github.com/complexible/pellet general staging classes and includes the other ontologies. The ICD-O topography ontol- ogy contains the topographic codes and superclasses (such as C00-C14 - Head and Neck), with English descriptions. The annotation ontology define the specific TNM classes (such as C50 T1 and C61 M0) and their corresponding description in Portuguese and English. Finally, the clinical stage axioms ontology define the logical axioms that allows the infer- ence of clinical stage based on ICD-O topography and TNM values. The consolidated ontologies have ALC (Attributive Concept Language with Com- plements) expressivity. It consisted of 4.382 axioms, 2.954 logical axioms, and 772 classes. It defines 1.690 subClassOf axions, 16 EquivalentTo axioms, 1.248 disjoint- Classes axioms and 643 AnnotationAssertion axioms. The ontology, the scripts and the text files used to generate it were released under the APACHE-2.0 5 open source license and are available online at https://github.com/djogopatrao/tnm_ontology/tree/master/ ontologies All 566 test instances were assigned a clinical stage, and only one was assigned two clinical stages. P atientT est 51 was supposed to be assigned Prostate Cancer Clini- cal Stage I, however an additional concept, Clinical Stage II, was present. This is because the definition of those clinical stages, as stated on the original reference, is ambiguous; Clinical Stage I is defined as T1a, N0, M0 and G1 (Gleason 2-4, discreet anaplasia), while Clinical Stage II, among other definitions, can be T1, N0, M0 and any G. T1, for prostate cancer in the 6th edition of TNM, means ”Clinically inapparent tumor neither palpable nor visible by imaging, while T1a (a subconcept for the former) is defined as ”Tumor incidental histologic finding in 5% or less of tissue resected. Therefore, as T1a is also T1, so Clinical Stage II is also applicable, and the definition of Clinical Stages in the prostate section of TNM 6th edition contained an ambiguity, detected by means of the ontology. 4. Discussion We successfully represented the desired TNM rules using an ontology with a simple ex- pressivity profile. That will allow the classification of tumors to remain computable. The NCIt and SNOMED CT ontologies provide the general concepts involved with tumor staging: the values and description for T, N, M and CS for each topography. However, NCIt does not contains all codes for all topographies. SNOMED CT, in the other hand, does not define which TNM edition their concepts refer to. Neither defined axioms for inferring the clinical stage. The work by Dameron et al. focus at the anatomical decomposition of a single topography, whereas the present work approaches several topographies, focusing on in- ference of clinical stage. Besides that, there is no description of the final ontology in the mentioned paper and the links provided are not available [Dameron et al. 2006]. In the paper by Boeker et al, a very detailed description of breast cancer TNM definitions is formalized in a very expressive ontology. The main objective of their work seems to be the formal representation of clinical examination findings for each value of T, N and M, with links to the anatomical and tumoral invasion patterns concepts. That 5 http://www.apache.org/licenses/LICENSE-2.0 allowed the analysis of inconsistencies and inaccuracies in the definitions of TNM it- self [Boeker et al. 2014]. However, the ontology at the time of this writing does not include the clinical stage classes, and thus does not provide axioms for their inference. Moreover, this ontology high level of expressivity (SRI) would arguably be less efficient than ALC for a given A-Box. The tests showed that the inference worked as expected, except in one case, in which the definition provided by the original reference is ambiguous. A related work [Boeker et al. 2014] also found similar ambiguities; this shows how ontologies can be used to prevent classification definition errors. The presented ontology may be applied to perform validation of existing databases or classify tumors based on TNM values. The usage of relational database to ontology mapping software [Calvanese et al. 2011] [Bizer 2004] [Cullot et al. 2007] allows the us- age of the present ontology and inference tools on relational databases, the de facto indus- try standard. As it provides annotations for the meaning of individual T, N and M values for each topography, it may also serve as a reference for physicians and cancer registry workers. As future work, the presented ontology may be completed to include all to- pographies and alignment with the NCIt ontology. Alignments with the TNM Ontol- ogy [Boeker et al. 2014] may also be of interest. Currently, there are annotations in both Portuguese and English, and other languages may be added. The ontology may be up- dated to represent the TNM 7th edition, possibly representing an alignment between it and the 6th edition, which may help database migration efforts. Finally, the pathological stage and other modifiers (such as stage post treatment) may also be implemented. 5. Conclusion We showed that the presented ontology accurately represents the descriptions and infer- ence rules from the selected topographies, fulfilling the main objective of this work. It may be useful in a number of tasks involving tumor staging. It is open source, allowing scrutiny and contributions from the scientific community. It has means to be linked to other TNM ontology efforts and well-established vocabularies, increasing its interoper- ability. Finally, it is lightweight to compute, being a valuable tool to validate or complete TNM databases. References Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D. L., Patel- Schneider, P. F., and Stein, L. A. (2004). OWL Web Ontology Language Reference. Technical report, W3C, http://www.w3.org/TR/owl-ref/. Bizer, C. (2004). D2rq - treating non-rdf databases as virtual rdf graphs. In In Proceedings of the 3rd International Semantic Web Conference (ISWC2004. Boeker, M., Faria, R., and Schulz, S. (2014). A Proposal for an Ontology for the Tumor- Node-Metastasis Classification of Malignant Tumors: a Study on Breast Tumors. In Jansen, L., Boeker, M., Herre, H., and Loebe, F., editors, Ontologies and Data in Life Sciences, number 1, pages B1–B5, Freiburg. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-Muro, M., Rosati, R., Ruzzi, M., and Savo, D. F. (2011). The mastro system for ontology- based data access. Semantic Web Journal, 2(1):43–53. Listed among the 5 most cited papers in the first five years of the Semantic Web Journal. Cullot, N., Ghawi, R., and Yétongnon, K. (2007). Db2owl : A tool for automatic database- to-ontology mapping. In Ceci, M., Malerba, D., and Tanca, L., editors, SEBD, pages 491–494. Dameron, O., Roques, E., Rubin, D., Marquet, G., and Burgun, A. (2006). Grading lung tumors using OWL-DL based reasoning. In 9th Intl. Protégé Conference, pages 1–4, Stanford, California. Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowl- edge Acquisition, 5(2):199 – 220. Langlotz, C. P. (2006). Radlex: A new method for indexing online educational materials. RadioGraphics, 26(6):1595–1597. PMID: 17102038. Lindberg, D. A., Humphreys, B. L., and McCray, A. T. (1993). The unified medical language system. Methods Archive, 32(4):281–291. Marquet, G. and Dameron, O. (2007). Grading glioma tumors using OWL-DL and NCI thesaurus. AMIA Annual . . . , pages 508–512. Nelson, S., Johnston, W. D., and Humphreys, B. (2001). volume 2 of Information Sci- ence and Knowledge Management, chapter Relationships in Medical Subject Headings (MeSH), pages 171–184. Springer Netherlands. Neuss, M. N., Desch, C. E., McNiff, K. K., Eisenberg, P. D., Gesme, D. H., Jacobson, J. O., Jahanzeb, M., Padberg, J. J., Rainey, J. M., Guo, J. J., and Simone, J. V. (2005). A process for measuring the quality of cancer care: The quality oncology practice initiative. Journal of Clinical Oncology, 23(25):6233–6239. Rosse, C. and Jr., J. L. M. (2003). A reference ontology for biomedical informatics: the foundational model of anatomy. Journal of Biomedical Informatics, 36(6):478 – 500. Unified Medical Language System. Schmoll, H.-J. (2003). F.l. greene, d.l. page, i.d. fleming et al. (eds). ajcc cancer staging manual, 6th edition. Annals of Oncology, 14(2):345–346. Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A. L., and Rosse, C. (2005). Relations in biomedical ontologies. Genome Biol, 6(5):R46–R46. gb-2005-6-5-r46[PII]. Sobin, L. and Wittekind C (2002). Classificação de Tumores Malignos. Wiley and Sons, New York, 6th edition.