Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                        1


     Ontology-Enhanced Representations of Non-image
           Data in The Cancer Imaging Archive

                                       Jonathan P. Bona, Tracy S. Nolan, Mathias Brochhausen
                                               Department of Biomedical Informatics
                                             University of Arkansas for Medical Sciences
                                                Little Rock, Arkansas, United States
                                     jpbona@uams.edu, tnolan@uams.edu, mbrochhausen@uams.edu


    Abstract—The Cancer Imaging Archive (TCIA) hosts over 11                   in different collections to encode the same or similar
million de-identified medical images related to cancer for research            information.
reuse. These are organized around DICOM-format radiological
collections that are grouped by disease type, modality, or research                This paper describes work to make these diverse non-image
focus. Many collections also include diverse non-image datasets in             data more accessible and usable. Our immediate aims are to: 1)
a variety of formats without a common approach to representing                 make these data sets queryable; 2) make them computer-
the entities that the data are about. This paper describes work to             interpretable, and hence available for automated reasoning and
make these diverse non-image data more accessible and usable by                more amenable to exploration and analysis; and 3) establish
transforming them into integrated semantic representations using               links between related data across collections and across data
Open Biomedical Ontologies, highlights obstacles encountered in                types.
the data, and presents detailed representations data found in select
collections.                                                                       To support these aims we are converting these data into
                                                                               common, semantically-enhanced representations using Open
   Keywords—cancer; imaging, ontology development; semantics                   Biomedical Ontologies Foundry [2] resources, and integrating
                                                                               the results in a single repository with this shared representation,
                       I. INTRODUCTION                                         to facilitate queries such as “Which patients in lung cancer
    Since 2011 the Cancer Imaging Archive [1] has been NCI’s                   collection have been diagnosed with metastatic colon cancer,
primary resource for acquiring, curating, managing and                         and how was that diagnosis obtained?”, or “Which patients
distributing images and related data to support Cancer Research.               across multiple head and neck cancer collections have tumors
TCIA hosts over 11 million de-identified medical images of                     specifically in their oropharynx, and have been diagnosed with
cancer for research reuse, organized around DICOM-format                       human papillomavirus, and how were those diagnoses
radiological collections related by disease type, modality, or                 obtained?”
research focus. The PRISM (Platform for Imaging in Precision                       By using ontologies and semantic web technology we are
Medicine) initiative seeks to sustain and expand TCIA’s                        making these data more readily available for query, automated
capabilities to meet the rapidly evolving requirements of cancer               reasoning, exploration, and analysis. TCIA users in general are
Precision Medicine research. Through discussions with                          not familiar enough with biomedical ontologies and semantic
investigators in the imaging and cancer research communities,                  web technology to write SPARQL queries to access data. This
and through review of TCIA helpdesk tickets, we have identified                semantic repository with transformed non-image data will serve
a number of near-term goals and challenges. These include                      as the back-end data store for user-friendly tools that support
enhanced support for reproducible research and data publication                search and exploration of the data.
capabilities; expanded support for additional data types,
including pathology data, and radiomics and pathomics feature                           II. NON-IMAGE DATA IN TCIA COLLECTIONS
sets; uniform management of non-image data; semantic query
mechanisms and enhanced data exploration; and automatic                        A. Overview
curation of current and new data types.                                            The Cancer Imaging Archive currently has 74 publicly-
                                                                               available collections. We reviewed and compared the de-
    Many TCIA collections include non-image data in a variety                  identified non-image data provided with these collections as a
of formats, often as downloadable spreadsheet files without a                  first step toward crafting a semantic representation useable to
common representation scheme. These include patient                            represent the bulk of non-image data in TCIA collections.
demographics, diagnoses, treatments, outcomes, TNM staging,
gene assays and other test results, etc. Some collections provide                  A large group of 18 of the public collections is provided by
data dictionaries or other documentation that aid the human                    The Cancer Genome Atlas [3]. These collections, whose names
reader in interpreting these data. However, these are not                      all start with “TCGA”, structure their data using a common,
machine-interpretable, and hence are difficult to query.                       standardized representation scheme that is published as an RDF
Complicating this is the use of different representations schemes              file in Turtle format [4]. TCGA linked data has also been
                                                                               exposed as a SPARQL endpoint [5]. Our work to integrate non-


       ICBO 2018                                                   August 7-10, 2018                                                     1
       Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                     2


image data from all TCIA collections into a single repository                      The current submission process for TCIA non-image data
will be made much easier in the case of TCGA collections                       does not specify the use of any common data model or schema,
because of their use of a standard scheme and semantic web                     or require adherence to any specified semantics. This leads to
technology. Throughout this paper we discuss and describe only                 some of the submitted data being ambiguous, or difficult to
non-TCGA public collections, because those are the collections                 interpret. The semantically-rich representations that we are
that best illustrate the diversity of available data and data                  designing for this data, as presented in this paper, will become
representations, and the need for improved representations.                    part of new submission tools TCIA that automate this curation
                                                                               as much as possible.
    Of the 56 non-TCGA public collections, 17 include
downloadable non-image data (often labeled “clinical data”).                       The non-image data contained in these 17 collections can be
This is in addition to, and separate from, the image metadata                  placed into 7 major categories: diagnosis, histology, genetic
present in many collections. We have manually reviewed each                    testing, demographics, treatment, morbidity, and neurological
of the files provided with these collections. This section provides            testing. The latter is a category only found in one of the
a summary and discussion that illustrate the richness and                      collections that currently provide non-imaging data. Most of
diversity of the data available, and the diversity of representation           those categories are already broken down into subcategories.
schemes currently used. This diversity poses significant                       E.g. “treatment” is broken down into “primary: chemo”,
challenges to integration of the non-image data, but also poses                “primary: surgery”, and “primary: radiation”, and “adjuvant”.
an unique opportunity to vastly improve the usability of this data             Table I below indicates for each of these 17 collections the
with semantic web technology and biomedical ontologies.                        presence or absence of data in each category and subcategory.
                                                                               The types of non-imaging data that exist for a collection is


                                       TABLE I.     DIVERSITY OF NON-IMAGE DATA IN PUBLIC TCIA COLLECTIONS


       ICBO 2018                                                   August 7-10, 2018                                                  2
       Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                               3


marked green in the table. Types of non-imaging data that do not
exist in a collection are represented by blank cells. If a collection                  TABLE II.       LIDC-IDRI PATIENT-LEVEL DIAGNOSIS KEY
provides data for at least one subtype of data, the major category
is marked as existing. In very few cases a major category is
represented as existing, when none of their subtypes are part of
the collection’s non-imaging data, but some data that of that
category exists. The diagonally striped cells signify data that was
not described or identified sufficiently and they are represented
based on an assumption of the authors. One example for this is
“laterality”. In the “BREAST_DIAGNOSIS” collection we find
both laterality applied to the diagnosis (on which side the tumor
is located) and laterality for MRI, which specified for which side              TABLE III.        TEN ENTRIES FOR DIAGNOSIS, METHOD, AND TUMOR SITE
an MRI was taken. Other collections did provide laterality
information, but didn’t specify whether that was tumor
localization information or imaging localization information.
We did assume that this data represented the localization of the
tumor, since that was most consistent with the context.
    The following sections describe three of these collections
that we have examined in more detail, highlighting specific
hurdles presented by the representation schemes used. We then
present ontology-based representations we have designed for
use in our transformation of these data into a semantic
repository.                                                                                                Fig. 1. TABLE III.
                                                                               consecutive entries (ID8 - ID10) that indicate metastatic colon
B. LIDC-IDRI Collection                                                        cancer by using three different values in the tumor site column:
    The Lung Image Database Consortium image collection                        “colon cancer,” “colon,” and “metastatic colon cancer.” In this
(LIDC-IDRI)1 [6, 7] contains non-image data for patients, many                 case, all three contain the word “colon”, so a string-based text
of whose lung cancers are the result of metastasis of other cancer             search for that term would locate these records. However, query
types from locations other than their lungs. The data is provided              and integration of these data will obviously benefit from
as a spreadsheet labeled as “patient diagnoses”. The sheet has                 translation to a computer-interpretable, shared representation
columns for a de-identified patient ID linking it to other data                that is explicit about which entities are involved.
about this person (including images), a patient-level diagnosis,                   Some of these tumor site entries do explicitly denote
diagnosis method, a primary tumor site for metastatic disease,                 anatomical locations, containing only short words like ‘colon’
and similar diagnosis information about lung nodules. We use                   and ‘bladder.’ Others are descriptions that mention cancer types
this sheet as a running example throughout this section, focusing              mixed with information that indicates locations (‘non small cell
on the patient level diagnosis, including diagnosis method, and                lung left lower lobe’, ‘uterine cancer’, ‘granular cell tumor of
the primary metastatic tumor site.                                             the trachea’). Some only name a disease type (‘lymphoma’,
    An immediate obstacle to querying these data is the use of a               ‘adenocarcinoma’), or use an abbreviation that may allow a
terse coding system to indicate values. This system is presented               person with domain knowledge to infer the location, such as
as a key within column headers in the sheet itself (shown in Fig               ‘HCC’ -- hepatocellular carcinoma, which occurs in the liver, or
2), making it available to a human reader, though not necessarily              ‘NSCLC’ -- non-small cell lung cancer. As discussed more in
easy to interpret. This key is not computer-interpretable, making              the Methods section below, we found necessary to manually
the data difficult to query even if it were extracted from the                 curate an intermediate spreadsheet with location-denoting terms
spreadsheet and used to populate a database table in this form.                before this data could be converted to an OWL representation.

     For example, in this representation scheme, a 3 in the patient            C. Two Head and Neck Cancer Collections
level diagnosis column indicates malignant metastatic disease,                     The Head-Neck-PET-CT collection2 [8] contains non-image
while a 3 in the diagnosis method column indicates that the                    data, including diagnostic and treatment information for patients
relevant diagnosis was determined by surgical resection. Similar               with head and neck cancer. The HNSCC (Head and Neck
information is provided in separate columns for each identified                Squamous Cell Carcinoma)[9, 10] collection3 contains much of
lung nodule. To make matters worse, files with the same type of                the same information. These collections overlap significantly in
information in other collections use different encoding schemes,               their contents, though with some notational differences. This
further complicating integrated querying and use of the data.                  section compares a subset of the non-image data provided with
                                                                               these two collections, focusing on a few key data types in these
    Even fields in this file with more explicit entries can be
unclear or ambiguous. For instance, the tumor site column in this
file consists of short, free text (non-standardized) descriptions,
as illustrated in the excerpt in Table III, which shows three
          1                                                                                  2
              http://dx.doi.org/10.7937/K9/TCIA.2015.LO9QL9SX                                    http://doi.org/10.7937/K9/TCIA.2017.8oje5q00
                                                                                             3
                                                                                                 http://doi.org/10.7937/K9/TCIA.2017.umz8dv6s


       ICBO 2018                                                   August 7-10, 2018                                                            3
      Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                                  4


collections for which we have implemented ontology-based                                                III. METHODS
representations, as discussed more in the Methods section.                         We designed and built representations for these data using
                                                                               OBO Foundry ontologies, including the Human Disease
                                                                               Ontology [11] and The Uber Anatomy Ontology, Uberon [12].
   TABLE IV.      EXCERPT FROM HEAD AND NECK SQUAMOUS CELL                     Instances for individual entries are linked to ontology classes to
                     CARCINOMA COLLECTION                                      explicitly represent locations, disease types, and diagnosis
                                                                               methods. These representations are used to transform data from
                                                                               spreadsheets from these three collection into OWL/RDF files
                                                                               that are loaded into a triple store database for reasoning and
                                                                               query. This section presents the details of these representations,
                                                                               and of the translation process.


   TABLE V.       EXCERPT FROM HEAD-NECK-PET-CT COLLECTION


                                                                                Fig. 1. Representing positive HPV status for a head and neck cancer patient

                                                                               A. Ontology-based Representation
    Both of these head and neck cancer collections contain                         Fig. 1 shows how we represent a patient’s positive HPV
additional types of data not discussed here, many of which we                  diagnosis in the head and neck collections. In this figure the
also will transform into semantic representations for our                      labeled ovals stand for ontology classes. The smaller circles
integrated repository as this project progresses. As shown in                  stand for anonymous instances, which are linked to their classes
Tables IV and V below, both collections include the biological                 through rdf:type assertions. The rectangle stands for a labeled
sex of the patient, among other demographic data, as well as                   instance. HPV status is provided in these sheets without specific
tumor staging information, HPV status, and an indication of the                information about how it was determined, so we can assert only
primary tumor location.


                                                 Fig. 2. Disease and diagnosis for a lung cancer patient


      ICBO 2018                                                    August 7-10, 2018                                                              4
      Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                             5


that an ‘OGMS: diagnostic process’ with some ‘OBI: assay’ has
occurred (this includes physical examinations or other methods
that are not strictly lab tests), and that the output was an ‘OGMS:
diagnosis’ about an instance of ‘DO: papillomavirus infectious
disease’ that inheres in the patient. Not shown in Fig. 1 is
information about the patient’s cancer, though a link exists, both
in reality and in our representation of this data, via the patient.
The representation of this information of head and neck
collection data is nearly identical to the representation used for                               Fig. 3.   Transforming non-image data
the lung collection, as shown in Fig. 4.
                                                                              the RDFLib library to build OWL individuals from the instance
    Fig. 2 shows our representation for a patient’s disease and               data contained in each sheet, asserting the prescribed relations
diagnosis using data from the LIDC collection as an example.                  among these individuals, and saving the results in an OWL file.
The patient record shown here is for a person whose colon                     As part of this process, the script reads the spreadsheets,
cancer has spread to their lungs, as determined by a biopsy. This             determines which URIs are needed, and automatically generates
patient has two instances of ‘DO: cancer’, one that inheres in                OntoFox [13] requests for each external ontology used. OntoFox
the patient’s ‘UBERON: colon’ and one that inheres in the                     is a web-based term extraction tool that supports ontology reuse.
‘UBERON: lung’. An ‘OGMS: diagnostic process’ with some                       Our script uses OntoFox to retrieve hierarchical information and
‘OBIB: biopsy’ as part has produced as output an ‘OGMS:                       select other details only for those classes and relations that are
diagnosis’ that is about the patient. The biopsy evaluated an                 needed to represent the data. It invokes the ROBOT command
‘UBERON: portion of tissue’ that was derived from an ‘EFO:                    line tool [14] to convert between RDF serializations, e.g. to
neoplasm’ that was located in the patient’s lung. In this case the            convert OntoFox’s default RDF/XML output into turtle format
dataset does not contain more specific information about which                for ease of use with RDFLib. The resulting OWL files are added
type of cancer inheres in each location. Note that an OWL                     to a triple store, making them available for reasoning and query.
reasoner could infer more specific types for these instances from
the assertions in Figure 2, and from logical definitions in the                 PREFIX inheres: <http://purl.obolibrary.org/obo/RO_0000052>
Disease Ontology, concluding e.g. that the instance of cancer                   PREFIX human: <http://purl.obolibrary.org/obo/NCBITaxon_9606>
inhering in the patient’s lung is an instance of lung cancer.                   PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                                                                                PREFIX identifier: <http://purl.obolibrary.org/obo/IAO_0020000>
B. Data Transformation and Populating Repository                                PREFIX denotes: <http://purl.obolibrary.org/obo/IAO_0000219>
    As discussed above, the lung cancer collection uses some                    PREFIX oroph: <http://purl.obolibrary.org/obo/UBERON_0001729>
                                                                                PREFIX cancer: <http://purl.obolibrary.org/obo/DOID_162>
values that require manual interpretation by a human to identify                PREFIX has_part: <http://purl.obolibrary.org/obo/BFO_0000051>
which anatomical entities, if any, are specified. To facilitate the             PREFIX hpv: <http://purl.obolibrary.org/obo/DOID_11166>
transformation of this collection into OWL, we built a                          PREFIX disease: <http://purl.obolibrary.org/obo/DOID_4>
spreadsheet listing all 110 unique values from the primary tumor                select ?idl {
site field, and used this to record and track the extent to which                 # the person and identifier
                                                                                  ?person rdf:type human: .
each value in that field indicates an anatomical location. Of these               ?id denotes: ?person .
110 primary tumor site entries, only 9 are short terms that                       ?id rdf:type identifier: .
precisely denote an anatomical location. 54 others explicitly                     ?id rdfs:label ?idl .
mention an identifiable location, often as part of a description                                                                      idl
that also names the disease type. In total, including entries where               # the person has hpv                                HNSCC-01-0050
the location can be inferred from use of a standard abbreviation,                 ?hpv rdf:type hpv: .                                HNSCC-01-0054
                                                                                  ?hpv inheres: ?person .                             HN-HGJ-018
76 out of 110 indicate a clear location for the primary tumor of
the metastatic disease. For each these, we manually located and                                                                       HNSCC-01-0098
                                                                                  # the person's oropharynx
recorded the matching Uberon class in the sheet for use in our                    ?person has_part: ?o .                              HNSCC-01-0116
ontology-based representation of the data. This secondary sheet                   ?o rdf:type oroph: .
was then used as input to a Python script to retrieve and record
                                                                                   # cancer in the oropharynx
the correct anatomical classes for tumor sites even for those                      ?d inheres: ?o .
records where the literal value stored in the source sheet did not                 ?d rdf:type cancer: .
strictly identify a location.                                                   } limit 5

    To transform the two head and neck collections, manual                                             IV. RESULTS
curation of a secondary sheet was unnecessary because the
tumor site entries in those two sheets contain only one of a few                  The resulting triple store contains assertions linking patient
values: 'Larynx', 'Nasopharynx', 'Hypopharynx', 'Oropharynx’,                 identifiers to RDF instances representing patients, affected body
'Glottis', 'Sinus', 'Oral cavity’, 'unknown', ‘CUP'. A value of               parts, diagnoses, relations among those, etc. OBO Foundry
‘CUP’ indicates cancer of unknown primary, so it carries similar              Ontologies provide the types (OWL classes) for these instances
information as the value ‘unknown’. The other seven values                    and define the relations (OWL object properties).
clearly denote anatomical locations found in Uberon.                              This database can be queried using SPARQL to identify
    These collection data sheets, including a secondary sheet for             patient records matching criteria based on these fields that were
the lung collection, were processed with a Python script using                previously inaccessible, as well as queries that operate across


      ICBO 2018                                                   August 7-10, 2018                                                          5
      Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                      6


collections. For example, the query shown above gets a list of                [3]       J. N. Weinstein et al., "The Cancer Genome Atlas Pan-
patient identifiers for patients who have been diagnosed with                           Cancer Analysis Project," Nature genetics, vol. 45, no.
HPV, and who have also been diagnosed with a cancerous tumor                            10, pp. 1113-1120, 2013.
in their oropharynx. This query is able to retrieve results from              [4]       T. C. G. Atlas. (2013). TCGA Roadmap. Available:
both the Head-Neck-PET-CT collection and the Head and Neck                              https://old.datahub.io/dataset/tcga-roadmap
Squamous Cell Carcinoma collection because the relevant data                  [5]       H. F. Deus, D. F. Veiga, P. R. Freire, J. N. Weinstein,
are now represented in the same way in the triple store. This                           G. B. Mills, and J. S. Almeida, "Exposing the cancer
enhanced data is immediately available for simple reasoning                             genome atlas as a SPARQL endpoint," Journal of
tasks allowed by the use of ontologies, e.g. using the partonomic
                                                                                        Biomedical Informatics, vol. 43, no. 6, pp. 998-1008,
information built into UBERON to support queries at different
                                                                                        2010/12/01/ 2010.
levels of anatomical granularity.
                                                                              [6]       S. G. Armato et al., "The Lung Image Database
              V. DISCUSSION AND FUTURE WORK                                             Consortium (LIDC) and Image Database Resource
    The Cancer Imaging Archive contains a wealth of diverse                             Initiative (IDRI): A Completed Reference Database of
non-image data that is currently difficult to work with because                         Lung Nodules on CT Scans," Medical Physics, vol.
much of it, though publicly available, is locked away in                                38, no. 2, pp. 915-931.
spreadsheet files that must to be downloaded and interpreted                  [7]       S. Armato Iii et al., "Data from LIDC-IDRI. The
individually. As part of ongoing development work for TCIA                              cancer imaging archive," ed, 2015.
and for the PRISM platform, we are examining the contents of                  [8]       M. Vallières et al., "Radiomics strategies for risk
these files, cataloging and characterizing the data therein, and                        assessment of tumour failure in head-and-neck
designing realist ontology-based representations that explicitly                        cancer," Scientific Reports, vol. 7, p. 10117.
the entities that these data are about.                                       [9]       M. A. Grossberg A, Elhalawani H, Bennett W, Smith
                                                                                        K, Nolan T, Chamchod S, Kantor M, Browne
    The examples presented in this paper demonstrate the
                                                                                        T, Hutcheson K, Gunn G, Garden A, Frank S,
usefulness of ontologies and semantic web tools for knowledge
representation to enable querying of otherwise opaque non-                              Rosenthal D, Freymann J, Fuller C, "Data from Head
image data in these TCIA collections. We are expanding this                             and Neck Cancer CT Atlas. ," ed. The Cancer Imaging
work beyond the collections presented here to include more data                         Archive, 2017.
from the archive. The graph-based nature of RDF stores allows                 [10]      M. A. Grossberg A, Elhalawani H, Bennett W, Smith
us to incrementally add and link knowledge from different                               K, Nolan T, Williams B, Chamchod S, Heukelom
collections and files within them as the representation work                            J, Kantor M, Browne T, Hutcheson               K, Gunn
proceeds, simplifying the task of integrating these data.                               G, Garden A,Morrison W, Frank S, Rosenthal
                                                                                        D, Freymann J, Fuller C, "Imaging and Clinical Data
   Because most users prefer not to write SPARQL queries, a                             Archive for Head and Neck Squamous Cell
next step is the development of user-friendly interfaces to help
                                                                                        Carcinoma Patients Treated with Radiotherapy.," ed.
end users search, explore, and interpret these data. We also plan
to provide ontology-driven submission tools that will                                   The Cancer Imaging Archive, 2018.
automatically generate the same representations, allowing for                 [11]      L. M. Schriml et al., "Disease Ontology: a backbone
seamless integration of new datasets.                                                   for disease semantic integration," Nucleic Acids
                                                                                        Research, vol. 40, no. D1, pp. D940-D946, 2012.
                        VI. REFERENCES                                        [12]      C. J. Mungall, C. Torniai, G. V. Gkoutos, S. E. Lewis,
                                                                                        and M. A. Haendel, "Uberon, an integrative multi-
[1]      K. Clark et al., "The Cancer Imaging Archive (TCIA):                           species anatomy ontology," Genome Biology, vol. 13,
         Maintaining and Operating a Public Information                                 no. 1, pp. R5-R5, 01/31
         Repository," Journal of Digital Imaging, vol. 26, no.                 [13]     Z. Xiang, M. Courtot, R. R. Brinkman, A. Ruttenberg,
         6, pp. 1045-1057, 2013/12/01 2013.                                             and Y. He, "OntoFox: web-based support for ontology
[2]      B. Smith et al., "The OBO Foundry: coordinated                                 reuse," BMC Research Notes, vol. 3, no. 1, p. 175,
         evolution of ontologies to support biomedical data                             2010/06/22 2010.
         integration," Nature biotechnology, vol. 25, no. 11, p.              [14]      J. A. Overton, H. Dietze, S. Essaid, D. Osumi-
         1251, 2007.                                                                    Sutherland, and C. J. Mungall, "ROBOT: A
                                                                                        command-line tool for ontology development," 2015,
                                                                                        pp. 131-132.


  This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health
  under Contract No. HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the
  Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply
  endorsement by the U.S. Government. Under this contract the University of Arkansas is funded by Leidos Biomedical Research
  subcontract 16X011. Funding was also provided by U24CA215109.


      ICBO 2018                                                   August 7-10, 2018                                                   6