Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 1 Ontology-Enhanced Representations of Non-image Data in The Cancer Imaging Archive Jonathan P. Bona, Tracy S. Nolan, Mathias Brochhausen Department of Biomedical Informatics University of Arkansas for Medical Sciences Little Rock, Arkansas, United States jpbona@uams.edu, tnolan@uams.edu, mbrochhausen@uams.edu Abstract—The Cancer Imaging Archive (TCIA) hosts over 11 in different collections to encode the same or similar million de-identified medical images related to cancer for research information. reuse. These are organized around DICOM-format radiological collections that are grouped by disease type, modality, or research This paper describes work to make these diverse non-image focus. Many collections also include diverse non-image datasets in data more accessible and usable. Our immediate aims are to: 1) a variety of formats without a common approach to representing make these data sets queryable; 2) make them computer- the entities that the data are about. This paper describes work to interpretable, and hence available for automated reasoning and make these diverse non-image data more accessible and usable by more amenable to exploration and analysis; and 3) establish transforming them into integrated semantic representations using links between related data across collections and across data Open Biomedical Ontologies, highlights obstacles encountered in types. the data, and presents detailed representations data found in select collections. To support these aims we are converting these data into common, semantically-enhanced representations using Open Keywords—cancer; imaging, ontology development; semantics Biomedical Ontologies Foundry [2] resources, and integrating the results in a single repository with this shared representation, I. INTRODUCTION to facilitate queries such as “Which patients in lung cancer Since 2011 the Cancer Imaging Archive [1] has been NCI’s collection have been diagnosed with metastatic colon cancer, primary resource for acquiring, curating, managing and and how was that diagnosis obtained?”, or “Which patients distributing images and related data to support Cancer Research. across multiple head and neck cancer collections have tumors TCIA hosts over 11 million de-identified medical images of specifically in their oropharynx, and have been diagnosed with cancer for research reuse, organized around DICOM-format human papillomavirus, and how were those diagnoses radiological collections related by disease type, modality, or obtained?” research focus. The PRISM (Platform for Imaging in Precision By using ontologies and semantic web technology we are Medicine) initiative seeks to sustain and expand TCIA’s making these data more readily available for query, automated capabilities to meet the rapidly evolving requirements of cancer reasoning, exploration, and analysis. TCIA users in general are Precision Medicine research. Through discussions with not familiar enough with biomedical ontologies and semantic investigators in the imaging and cancer research communities, web technology to write SPARQL queries to access data. This and through review of TCIA helpdesk tickets, we have identified semantic repository with transformed non-image data will serve a number of near-term goals and challenges. These include as the back-end data store for user-friendly tools that support enhanced support for reproducible research and data publication search and exploration of the data. capabilities; expanded support for additional data types, including pathology data, and radiomics and pathomics feature II. NON-IMAGE DATA IN TCIA COLLECTIONS sets; uniform management of non-image data; semantic query mechanisms and enhanced data exploration; and automatic A. Overview curation of current and new data types. The Cancer Imaging Archive currently has 74 publicly- available collections. We reviewed and compared the de- Many TCIA collections include non-image data in a variety identified non-image data provided with these collections as a of formats, often as downloadable spreadsheet files without a first step toward crafting a semantic representation useable to common representation scheme. These include patient represent the bulk of non-image data in TCIA collections. demographics, diagnoses, treatments, outcomes, TNM staging, gene assays and other test results, etc. Some collections provide A large group of 18 of the public collections is provided by data dictionaries or other documentation that aid the human The Cancer Genome Atlas [3]. These collections, whose names reader in interpreting these data. However, these are not all start with “TCGA”, structure their data using a common, machine-interpretable, and hence are difficult to query. standardized representation scheme that is published as an RDF Complicating this is the use of different representations schemes file in Turtle format [4]. TCGA linked data has also been exposed as a SPARQL endpoint [5]. Our work to integrate non- ICBO 2018 August 7-10, 2018 1 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 2 image data from all TCIA collections into a single repository The current submission process for TCIA non-image data will be made much easier in the case of TCGA collections does not specify the use of any common data model or schema, because of their use of a standard scheme and semantic web or require adherence to any specified semantics. This leads to technology. Throughout this paper we discuss and describe only some of the submitted data being ambiguous, or difficult to non-TCGA public collections, because those are the collections interpret. The semantically-rich representations that we are that best illustrate the diversity of available data and data designing for this data, as presented in this paper, will become representations, and the need for improved representations. part of new submission tools TCIA that automate this curation as much as possible. Of the 56 non-TCGA public collections, 17 include downloadable non-image data (often labeled “clinical data”). The non-image data contained in these 17 collections can be This is in addition to, and separate from, the image metadata placed into 7 major categories: diagnosis, histology, genetic present in many collections. We have manually reviewed each testing, demographics, treatment, morbidity, and neurological of the files provided with these collections. This section provides testing. The latter is a category only found in one of the a summary and discussion that illustrate the richness and collections that currently provide non-imaging data. Most of diversity of the data available, and the diversity of representation those categories are already broken down into subcategories. schemes currently used. This diversity poses significant E.g. “treatment” is broken down into “primary: chemo”, challenges to integration of the non-image data, but also poses “primary: surgery”, and “primary: radiation”, and “adjuvant”. an unique opportunity to vastly improve the usability of this data Table I below indicates for each of these 17 collections the with semantic web technology and biomedical ontologies. presence or absence of data in each category and subcategory. The types of non-imaging data that exist for a collection is TABLE I. DIVERSITY OF NON-IMAGE DATA IN PUBLIC TCIA COLLECTIONS ICBO 2018 August 7-10, 2018 2 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 3 marked green in the table. Types of non-imaging data that do not exist in a collection are represented by blank cells. If a collection TABLE II. LIDC-IDRI PATIENT-LEVEL DIAGNOSIS KEY provides data for at least one subtype of data, the major category is marked as existing. In very few cases a major category is represented as existing, when none of their subtypes are part of the collection’s non-imaging data, but some data that of that category exists. The diagonally striped cells signify data that was not described or identified sufficiently and they are represented based on an assumption of the authors. One example for this is “laterality”. In the “BREAST_DIAGNOSIS” collection we find both laterality applied to the diagnosis (on which side the tumor is located) and laterality for MRI, which specified for which side TABLE III. TEN ENTRIES FOR DIAGNOSIS, METHOD, AND TUMOR SITE an MRI was taken. Other collections did provide laterality information, but didn’t specify whether that was tumor localization information or imaging localization information. We did assume that this data represented the localization of the tumor, since that was most consistent with the context. The following sections describe three of these collections that we have examined in more detail, highlighting specific hurdles presented by the representation schemes used. We then present ontology-based representations we have designed for use in our transformation of these data into a semantic repository. Fig. 1. TABLE III. consecutive entries (ID8 - ID10) that indicate metastatic colon B. LIDC-IDRI Collection cancer by using three different values in the tumor site column: The Lung Image Database Consortium image collection “colon cancer,” “colon,” and “metastatic colon cancer.” In this (LIDC-IDRI)1 [6, 7] contains non-image data for patients, many case, all three contain the word “colon”, so a string-based text of whose lung cancers are the result of metastasis of other cancer search for that term would locate these records. However, query types from locations other than their lungs. The data is provided and integration of these data will obviously benefit from as a spreadsheet labeled as “patient diagnoses”. The sheet has translation to a computer-interpretable, shared representation columns for a de-identified patient ID linking it to other data that is explicit about which entities are involved. about this person (including images), a patient-level diagnosis, Some of these tumor site entries do explicitly denote diagnosis method, a primary tumor site for metastatic disease, anatomical locations, containing only short words like ‘colon’ and similar diagnosis information about lung nodules. We use and ‘bladder.’ Others are descriptions that mention cancer types this sheet as a running example throughout this section, focusing mixed with information that indicates locations (‘non small cell on the patient level diagnosis, including diagnosis method, and lung left lower lobe’, ‘uterine cancer’, ‘granular cell tumor of the primary metastatic tumor site. the trachea’). Some only name a disease type (‘lymphoma’, An immediate obstacle to querying these data is the use of a ‘adenocarcinoma’), or use an abbreviation that may allow a terse coding system to indicate values. This system is presented person with domain knowledge to infer the location, such as as a key within column headers in the sheet itself (shown in Fig ‘HCC’ -- hepatocellular carcinoma, which occurs in the liver, or 2), making it available to a human reader, though not necessarily ‘NSCLC’ -- non-small cell lung cancer. As discussed more in easy to interpret. This key is not computer-interpretable, making the Methods section below, we found necessary to manually the data difficult to query even if it were extracted from the curate an intermediate spreadsheet with location-denoting terms spreadsheet and used to populate a database table in this form. before this data could be converted to an OWL representation. For example, in this representation scheme, a 3 in the patient C. Two Head and Neck Cancer Collections level diagnosis column indicates malignant metastatic disease, The Head-Neck-PET-CT collection2 [8] contains non-image while a 3 in the diagnosis method column indicates that the data, including diagnostic and treatment information for patients relevant diagnosis was determined by surgical resection. Similar with head and neck cancer. The HNSCC (Head and Neck information is provided in separate columns for each identified Squamous Cell Carcinoma)[9, 10] collection3 contains much of lung nodule. To make matters worse, files with the same type of the same information. These collections overlap significantly in information in other collections use different encoding schemes, their contents, though with some notational differences. This further complicating integrated querying and use of the data. section compares a subset of the non-image data provided with these two collections, focusing on a few key data types in these Even fields in this file with more explicit entries can be unclear or ambiguous. For instance, the tumor site column in this file consists of short, free text (non-standardized) descriptions, as illustrated in the excerpt in Table III, which shows three 1 2 http://dx.doi.org/10.7937/K9/TCIA.2015.LO9QL9SX http://doi.org/10.7937/K9/TCIA.2017.8oje5q00 3 http://doi.org/10.7937/K9/TCIA.2017.umz8dv6s ICBO 2018 August 7-10, 2018 3 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 4 collections for which we have implemented ontology-based III. METHODS representations, as discussed more in the Methods section. We designed and built representations for these data using OBO Foundry ontologies, including the Human Disease Ontology [11] and The Uber Anatomy Ontology, Uberon [12]. TABLE IV. EXCERPT FROM HEAD AND NECK SQUAMOUS CELL Instances for individual entries are linked to ontology classes to CARCINOMA COLLECTION explicitly represent locations, disease types, and diagnosis methods. These representations are used to transform data from spreadsheets from these three collection into OWL/RDF files that are loaded into a triple store database for reasoning and query. This section presents the details of these representations, and of the translation process. TABLE V. EXCERPT FROM HEAD-NECK-PET-CT COLLECTION Fig. 1. Representing positive HPV status for a head and neck cancer patient A. Ontology-based Representation Both of these head and neck cancer collections contain Fig. 1 shows how we represent a patient’s positive HPV additional types of data not discussed here, many of which we diagnosis in the head and neck collections. In this figure the also will transform into semantic representations for our labeled ovals stand for ontology classes. The smaller circles integrated repository as this project progresses. As shown in stand for anonymous instances, which are linked to their classes Tables IV and V below, both collections include the biological through rdf:type assertions. The rectangle stands for a labeled sex of the patient, among other demographic data, as well as instance. HPV status is provided in these sheets without specific tumor staging information, HPV status, and an indication of the information about how it was determined, so we can assert only primary tumor location. Fig. 2. Disease and diagnosis for a lung cancer patient ICBO 2018 August 7-10, 2018 4 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 5 that an ‘OGMS: diagnostic process’ with some ‘OBI: assay’ has occurred (this includes physical examinations or other methods that are not strictly lab tests), and that the output was an ‘OGMS: diagnosis’ about an instance of ‘DO: papillomavirus infectious disease’ that inheres in the patient. Not shown in Fig. 1 is information about the patient’s cancer, though a link exists, both in reality and in our representation of this data, via the patient. The representation of this information of head and neck collection data is nearly identical to the representation used for Fig. 3. Transforming non-image data the lung collection, as shown in Fig. 4. the RDFLib library to build OWL individuals from the instance Fig. 2 shows our representation for a patient’s disease and data contained in each sheet, asserting the prescribed relations diagnosis using data from the LIDC collection as an example. among these individuals, and saving the results in an OWL file. The patient record shown here is for a person whose colon As part of this process, the script reads the spreadsheets, cancer has spread to their lungs, as determined by a biopsy. This determines which URIs are needed, and automatically generates patient has two instances of ‘DO: cancer’, one that inheres in OntoFox [13] requests for each external ontology used. OntoFox the patient’s ‘UBERON: colon’ and one that inheres in the is a web-based term extraction tool that supports ontology reuse. ‘UBERON: lung’. An ‘OGMS: diagnostic process’ with some Our script uses OntoFox to retrieve hierarchical information and ‘OBIB: biopsy’ as part has produced as output an ‘OGMS: select other details only for those classes and relations that are diagnosis’ that is about the patient. The biopsy evaluated an needed to represent the data. It invokes the ROBOT command ‘UBERON: portion of tissue’ that was derived from an ‘EFO: line tool [14] to convert between RDF serializations, e.g. to neoplasm’ that was located in the patient’s lung. In this case the convert OntoFox’s default RDF/XML output into turtle format dataset does not contain more specific information about which for ease of use with RDFLib. The resulting OWL files are added type of cancer inheres in each location. Note that an OWL to a triple store, making them available for reasoning and query. reasoner could infer more specific types for these instances from the assertions in Figure 2, and from logical definitions in the PREFIX inheres: Disease Ontology, concluding e.g. that the instance of cancer PREFIX human: inhering in the patient’s lung is an instance of lung cancer. PREFIX rdfs: PREFIX identifier: B. Data Transformation and Populating Repository PREFIX denotes: As discussed above, the lung cancer collection uses some PREFIX oroph: PREFIX cancer: values that require manual interpretation by a human to identify PREFIX has_part: which anatomical entities, if any, are specified. To facilitate the PREFIX hpv: transformation of this collection into OWL, we built a PREFIX disease: spreadsheet listing all 110 unique values from the primary tumor select ?idl { site field, and used this to record and track the extent to which # the person and identifier ?person rdf:type human: . each value in that field indicates an anatomical location. Of these ?id denotes: ?person . 110 primary tumor site entries, only 9 are short terms that ?id rdf:type identifier: . precisely denote an anatomical location. 54 others explicitly ?id rdfs:label ?idl . mention an identifiable location, often as part of a description idl that also names the disease type. In total, including entries where # the person has hpv HNSCC-01-0050 the location can be inferred from use of a standard abbreviation, ?hpv rdf:type hpv: . HNSCC-01-0054 ?hpv inheres: ?person . HN-HGJ-018 76 out of 110 indicate a clear location for the primary tumor of the metastatic disease. For each these, we manually located and HNSCC-01-0098 # the person's oropharynx recorded the matching Uberon class in the sheet for use in our ?person has_part: ?o . HNSCC-01-0116 ontology-based representation of the data. This secondary sheet ?o rdf:type oroph: . was then used as input to a Python script to retrieve and record # cancer in the oropharynx the correct anatomical classes for tumor sites even for those ?d inheres: ?o . records where the literal value stored in the source sheet did not ?d rdf:type cancer: . strictly identify a location. } limit 5 To transform the two head and neck collections, manual IV. RESULTS curation of a secondary sheet was unnecessary because the tumor site entries in those two sheets contain only one of a few The resulting triple store contains assertions linking patient values: 'Larynx', 'Nasopharynx', 'Hypopharynx', 'Oropharynx’, identifiers to RDF instances representing patients, affected body 'Glottis', 'Sinus', 'Oral cavity’, 'unknown', ‘CUP'. A value of parts, diagnoses, relations among those, etc. OBO Foundry ‘CUP’ indicates cancer of unknown primary, so it carries similar Ontologies provide the types (OWL classes) for these instances information as the value ‘unknown’. The other seven values and define the relations (OWL object properties). clearly denote anatomical locations found in Uberon. This database can be queried using SPARQL to identify These collection data sheets, including a secondary sheet for patient records matching criteria based on these fields that were the lung collection, were processed with a Python script using previously inaccessible, as well as queries that operate across ICBO 2018 August 7-10, 2018 5 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 6 collections. For example, the query shown above gets a list of [3] J. N. Weinstein et al., "The Cancer Genome Atlas Pan- patient identifiers for patients who have been diagnosed with Cancer Analysis Project," Nature genetics, vol. 45, no. HPV, and who have also been diagnosed with a cancerous tumor 10, pp. 1113-1120, 2013. in their oropharynx. This query is able to retrieve results from [4] T. C. G. Atlas. (2013). TCGA Roadmap. Available: both the Head-Neck-PET-CT collection and the Head and Neck https://old.datahub.io/dataset/tcga-roadmap Squamous Cell Carcinoma collection because the relevant data [5] H. F. Deus, D. F. Veiga, P. R. Freire, J. N. Weinstein, are now represented in the same way in the triple store. This G. B. Mills, and J. S. Almeida, "Exposing the cancer enhanced data is immediately available for simple reasoning genome atlas as a SPARQL endpoint," Journal of tasks allowed by the use of ontologies, e.g. using the partonomic Biomedical Informatics, vol. 43, no. 6, pp. 998-1008, information built into UBERON to support queries at different 2010/12/01/ 2010. levels of anatomical granularity. [6] S. G. Armato et al., "The Lung Image Database V. DISCUSSION AND FUTURE WORK Consortium (LIDC) and Image Database Resource The Cancer Imaging Archive contains a wealth of diverse Initiative (IDRI): A Completed Reference Database of non-image data that is currently difficult to work with because Lung Nodules on CT Scans," Medical Physics, vol. much of it, though publicly available, is locked away in 38, no. 2, pp. 915-931. spreadsheet files that must to be downloaded and interpreted [7] S. Armato Iii et al., "Data from LIDC-IDRI. The individually. As part of ongoing development work for TCIA cancer imaging archive," ed, 2015. and for the PRISM platform, we are examining the contents of [8] M. Vallières et al., "Radiomics strategies for risk these files, cataloging and characterizing the data therein, and assessment of tumour failure in head-and-neck designing realist ontology-based representations that explicitly cancer," Scientific Reports, vol. 7, p. 10117. the entities that these data are about. [9] M. A. Grossberg A, Elhalawani H, Bennett W, Smith K, Nolan T, Chamchod S, Kantor M, Browne The examples presented in this paper demonstrate the T, Hutcheson K, Gunn G, Garden A, Frank S, usefulness of ontologies and semantic web tools for knowledge representation to enable querying of otherwise opaque non- Rosenthal D, Freymann J, Fuller C, "Data from Head image data in these TCIA collections. We are expanding this and Neck Cancer CT Atlas. ," ed. The Cancer Imaging work beyond the collections presented here to include more data Archive, 2017. from the archive. The graph-based nature of RDF stores allows [10] M. A. Grossberg A, Elhalawani H, Bennett W, Smith us to incrementally add and link knowledge from different K, Nolan T, Williams B, Chamchod S, Heukelom collections and files within them as the representation work J, Kantor M, Browne T, Hutcheson K, Gunn proceeds, simplifying the task of integrating these data. G, Garden A,Morrison W, Frank S, Rosenthal D, Freymann J, Fuller C, "Imaging and Clinical Data Because most users prefer not to write SPARQL queries, a Archive for Head and Neck Squamous Cell next step is the development of user-friendly interfaces to help Carcinoma Patients Treated with Radiotherapy.," ed. end users search, explore, and interpret these data. We also plan to provide ontology-driven submission tools that will The Cancer Imaging Archive, 2018. automatically generate the same representations, allowing for [11] L. M. Schriml et al., "Disease Ontology: a backbone seamless integration of new datasets. for disease semantic integration," Nucleic Acids Research, vol. 40, no. D1, pp. D940-D946, 2012. VI. REFERENCES [12] C. J. Mungall, C. Torniai, G. V. Gkoutos, S. E. Lewis, and M. A. Haendel, "Uberon, an integrative multi- [1] K. Clark et al., "The Cancer Imaging Archive (TCIA): species anatomy ontology," Genome Biology, vol. 13, Maintaining and Operating a Public Information no. 1, pp. R5-R5, 01/31 Repository," Journal of Digital Imaging, vol. 26, no. [13] Z. Xiang, M. Courtot, R. R. Brinkman, A. Ruttenberg, 6, pp. 1045-1057, 2013/12/01 2013. and Y. He, "OntoFox: web-based support for ontology [2] B. Smith et al., "The OBO Foundry: coordinated reuse," BMC Research Notes, vol. 3, no. 1, p. 175, evolution of ontologies to support biomedical data 2010/06/22 2010. integration," Nature biotechnology, vol. 25, no. 11, p. [14] J. A. Overton, H. Dietze, S. Essaid, D. Osumi- 1251, 2007. Sutherland, and C. J. Mungall, "ROBOT: A command-line tool for ontology development," 2015, pp. 131-132. This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health under Contract No. HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. Under this contract the University of Arkansas is funded by Leidos Biomedical Research subcontract 16X011. Funding was also provided by U24CA215109. ICBO 2018 August 7-10, 2018 6