<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology for Documentation of Variable and Data Source Selection Process to Support Integrative Data Analysis in Cancer Outcomes Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hansi Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiang Bian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Florida</institution>
          ,
          <addr-line>Gainesville FL 08544</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To improve cancer prognosis and survival, it is crucial to gain a comprehensive view of potential risk factors (RFs) associated with cancer outcomes (e.g., stage of diagnosis, cancer survival). Guided by the National Institute on Minority Health and Health Disparities (NIMHD) Research Framework, cancer outcomes are influenced by RFs from multiple levels (e.g., individual, inter-personal) and multiple domains (e.g., biological, behavioral). Prior research on RFs of cancer survival, however, has primarily focused on RFs from the individual level (e.g., tumor characteristics) due to the lack of integrated datasets that contain multi-level, multi-domain RFs. It is important to pool RFs from heterogeneous data sources, so that we can examine as many RFs as possible in a multi-level integrative data analysis (IDA). However, RF selection and data integration are inconsistently performed and poorly documented in current cancer research, which threatens scientific reproducibility. Therefore, in this paper, we developed a preliminary reporting protocol for RF variable and data source selection based on our previous cancer survival research. Our protocol is informed by NIMHD framework that provides guidance and promotes structural thinking on identifying multi-level cancer RFs. Further, we propose an ontology-based approach to document RF variable and data source selection so that it is (1) explicitly modeled with a shared, controlled vocabulary, (2) understandable to humans and executable by computers, and (3) adaptive to changes when the process being refined.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology</kwd>
        <kwd>Integrative Data Analysis</kwd>
        <kwd>Cancer Outcomes Research</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the United States (US), as the 2nd leading cause of death, cancer is responsible for 1
in every 4 deaths [1]. The lifetime probability of being diagnosed with cancer is 39.7%
and 37.6% for men and women, respectively [2]. To improve prognosis and survival,
the first and most crucial step is to gain a comprehensive view of potential risk factors
(RFs) associated with various cancer outcomes such as the stage of diagnosis (the most
important prognostic factor [3, 4]) and survival.</p>
      <p>
        Recognized by the National Institute on Minority Health and Health Disparities
(NIMHD) Research Framework [5], individuals are embedded within the larger social
system and constrained by the physical environment they lived in. Within this
framework, cancer outcomes are influenced by RFs from multiple levels (i.e., individual,
interpersonal, community, and societal) and multiple domains (i.e., biological,
behavioral, physical/built environment, sociocultural environment, and healthcare system).
Prior research on RFs of cancer outcomes, however, has primarily focused on factors
from the individual level (e.g., tumor characteristics) due to the lack of integrated
datasets that contain multi-level, multi-domain RFs. Very few studies have explored
contextual-level RFs (e.g., access to health care services); and certainly no study has
comprehensively explored all possible RFs together. To do so, it is important to pool RFs
from heterogeneous data sources through data integration, so that we can examine as
many RFs as possible in a multi-level integrative data analysis (IDA).
However, RF selection and data integration are inconsistently performed and poorly
documented, threatening transparency and reproducibility. When reporting research, it
is critical to document the steps that were followed to select, integrate, and process data;
so that others can repeat the same steps and reproduce the findings. In this paper, based
on our previous experience with multi-level IDAs [6], we developed a preliminary
reporting protocol for RF variable and data source selection. Our protocol is informed
by the NIMHD framework that provides guidance on identifying multi-level RFs.
Further, we propose an ontology-based approach so that the selection process is (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
explicitly modeled with a shared, controlled vocabulary, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) understandable to humans and
executable to computers, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) adaptive to changes when the process being refined.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>A reporting protocol for cancer risk factor selection, data source selection, and data integration informed by a multi-level IDA case study</title>
        <p>In a previous study, we assessed the effect of data integration on predictive ability of cancer
survival models [6] and created a semantic data integration (SDI) framework [7] to pool
multilevel RFs from heterogenous data sources to support IDA. Table 1 lists the selected RFs and
their data sources. Through the case study, a number of variable selection and data integration
steps need to be clearly documented. For example, area-rurality status of an individual’s
residency has different representations based on the choice of using either the rural-urban commuting
area definition [8] (i.e., 10 levels from rural to metropolitan) or the NCHS urban-rural
classification [9] (i.e., a hierarchal schema with 7 categories). It is important to document how rurality
was defined as different representations of the same variable or concept have differential impacts
on the predictive ability of the survival model. Further, a number of data assumptions were made
as the different datasets were collected at different time periods and on different populations. For
example, the Florida Cancer Data System data include cancer patients from 1996 to 2010, while
the US Census data we used were from the general population in 2010. Thus, we made
assumptions that the area-level characteristics derived from the Census data were applicable across
different time periods. Without a clear documentation of such assumptions and choices, other
researchers generally would not have a clear picture of these data integration nuances.
Through this multi-level IDA case study, we realized that to ensure the transparency
and reproducibility of our study, the documentation of our multi-level RF selection
choices (e.g., individual smoking status vs county-level smoking rate), data source
selection (e.g., individual-level data from FCDS and contextual-level data from US
Census), integration (e.g., data integration strategies and use cases), and processing steps
(e.g., the need to calculate body max index [BMI] using weight and height vs using a
calculated BMI field that came with the raw data) in the study are the key elements.
Through discussions with expert biostatisticians, data analysts and cancer outcomes
researchers, we summarized the typical IDA process and developed a prototype
reporting protocol for RF variable and data source selection. Further, we propose to inform
the multi-level and multi-domain RF selection process with the NIMHD research
framework, so that investigators can structurally and comprehensively identify relevant
RFs and data sources in their IDA studies.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Ontology for documentation of variable/data source selection (ODVDS)</title>
        <p>Scope. The scope of ODVDS was to standardize and document the selection,
integration and processing steps of RF variables and data sources to support IDAs guided by
the NIMHD research framework for cancer outcomes research.</p>
        <p>Approach. Using the basic formal ontology (BFO) as the top-level ontology, the
ODVDS was first developed with a top-down approach, where we started by
identifying candidate entities (classes and relations) based on the reporting protocol for RF
variable and data source selection. Following best practices, we reviewed existing
widely accepted ontologies and reused their classes and relations identified using the
NCBO BioPortal [10]. We also took a bottom-up process that started with creating the
definition of the most specific classes and then subsequent grouped similar classes into
more general concepts. The bottom-up approach helped us determine what new classes
and relations are needed to fully represent the IDA process.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Result</title>
      <sec id="sec-3-1">
        <title>A reporting protocol for RF variable and data source selection</title>
        <p>
          Informed by the NIMHD research framework, our preliminary reporting protocol
consists of two parts as shown in Fig 1(a), reporting (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) the objective of the study including
the background, rationale and hypotheses; and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) the study design for variable and data
source selection. The selection process consists of 5 key steps: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) set up the outcome
variables (i.e., primary and secondary outcomes [if necessary]); (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) for each outcome
variable, follow an iterative process (Fig 1(a). A) to determine the data sources for each
outcome variable according to NIMHD framework. For example, if the outcome of
interest is individual’s lung cancer risks, we shall first identify potential data sources
that contain individual-level patient data where lung cancer incidence data are
available. Then, based on the cohort criteria and other information such as sample size and
data range (e.g., time range and geographic information) of the potential data sources,
we could determine all qualified data sources; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) determine the individual-level
predictors and covariates of the study; (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) for each individual-level predictor, follow loop
B in Fig 1(a) to identify the different levels/domains of predictors according to NIMHD
framework. Note that multiple variables often exist for the same predictor variable
across different data sources, thus, it is important to contrast and consolidate a new
predictor with the existing selected predictors to resolve conflicts. Nevertheless, it is
often a difficult choice and these “duplicate” variables might all need to be tested in
models before a selection is finalized; and (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) follow loop C in Fig 1(a) to identify
additional contextual-level predictors and data sources of interest.
Based on the reporting protocol above, we identified the necessary classes and
properties to represent the IDA process. We reused classes from the following existing
wellknown ontologies: Ontology for Biomedical Investigations (OBI), National Cancer
Institute Thesaurus (NCIt), Data Science Education Ontology (DSEO), and Relations
Ontology (RO). We created 20 new classes in ODVDS. Overall, we identified 33 classes
and 5 properties. Fig 1(b) shows the class hierarchy of ODVDS.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion, conclusion, and future work</title>
      <p>
        In this work, we developed (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a reporting protocol for RF variable and data source
selection, and then an initial version of the ODVDS ontology for annotating the
documentation of the reporting protocol. However, our current work is limited: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the
protocol is developed based on one case study where the coverage RFs, cancer outcomes
and data integration scenarios is limited; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) we only reviewed a limited number of
existing reporting guidelines such as the Checklist for One Health Epidemiological
Reporting of Evidence (COHERE) [11] and Strengthening the Reporting of Observational
Studies in Epidemiology (STROBE) [12] statement. A more systematic review of
existing reporting guideline for variable and data source selection, data integration
process, statistical methods, and analysis plan in health research is warranted to expand the
reporting protocol. A good resource of these reporting guidelines is the Enhancing the
QUAlity and Transparency Of health Research (EQUATOR) network; and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) current
classes and properties in the initial ODVDS only covered RF variable and data source
selection. With the expansion of the reporting protocol (e.g., to include data integration
scenarios, processing of the data), new classes and properties to fully represent the
entire IDA process are needed. Further, tools associated with the reporting protocol and
ODVDS are needed as our ultimate goal is to help other investigators to “automatically”
reproduce the analytic steps, especially the data integration and processing steps.
Nevertheless, our ontology-based documentation approach provides a good start for
researchers to document the RF variable and data source selection process in their
multi-level IDAs. Clear documentation is necessary to help researchers communicate
their studies to other investigators, assist others to reproduce the analytic datasets, and
improve transparency and scientific reproducibility.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. CDC.
          <source>Statistics for Different Kinds of Cancer</source>
          .
          <year>2017</year>
          . https://www.cdc.gov/cancer/dcpc/data/types.htm.
          <source>Accessed 1 Jul</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. American Cancer Society.
          <source>Cancer Facts &amp; Figures</source>
          <year>2018</year>
          . Atlanta: American Cancer Society;
          <year>2018</year>
          . https://www.cancer.org/content/dam/cancer-org/research/cancer-factsand
          <article-title>-statistics/annual-cancer-facts-and-figures/2018/cancer-facts-and-figures2018.pdf</article-title>
          .
          <source>Accessed 28 Jun</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. American Cancer Society.
          <source>Cancer Facts &amp; Figures</source>
          <year>2017</year>
          .
          <year>2017</year>
          . https://www.cancer.org/content/dam/cancer-org/research/cancer
          <article-title>-facts-and-statistics/annual-cancerfacts-and-figures/2017/cancer-facts-and-figures-2017.pdf</article-title>
          .
          <source>Accessed 1 Jul</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Miller</surname>
            <given-names>KD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siegel</surname>
            <given-names>RL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>CC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariotto</surname>
            <given-names>AB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kramer</surname>
            <given-names>JL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rowland</surname>
            <given-names>JH</given-names>
          </string-name>
          , et al.
          <source>Cancer treatment and survivorship statistics</source>
          ,
          <year>2016</year>
          . CA Cancer J Clin.
          <year>2016</year>
          ;
          <volume>66</volume>
          :
          <fpage>271</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. NIMHD. NIMHD Research Framework. https://www.nimhd.nih.gov/about/ overview/research-framework.
          <source>html. Accessed 28 Jun</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Guo</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bian</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Modave</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>Q</given-names>
          </string-name>
          ,
          <string-name>
            <surname>George</surname>
            <given-names>TJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prosperi</surname>
            <given-names>M</given-names>
          </string-name>
          , et al.
          <article-title>Assessing the effect of data integration on predictive ability of cancer survival models</article-title>
          .
          <source>Health Informatics J</source>
          .
          <year>2019</year>
          ;:
          <fpage>1460458218824692</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zhang</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>Q</given-names>
          </string-name>
          ,
          <string-name>
            <surname>George</surname>
            <given-names>TJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenkman</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Modave</surname>
            <given-names>F</given-names>
          </string-name>
          , et al.
          <article-title>An ontologyguided semantic data integration framework to support integrative data analysis of cancer survival</article-title>
          .
          <source>BMC Med Inform Decis Mak</source>
          .
          <year>2018</year>
          ;
          <volume>18</volume>
          . doi:
          <volume>10</volume>
          .1186/s12911-018-0636-4.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>United</given-names>
            <surname>Stats Dpartment of Agriculture - Economic Reaserch Service. 2010 RuralUrban Commuting Area (RUCA) Codes</surname>
          </string-name>
          .
          <year>2019</year>
          . https://www.ers.usda.gov
          <article-title>/data-products/rural-urban-commuting-area-codes/documentation/</article-title>
          .
          <source>Accessed 8 Jul</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. National Center for Health Statistics - Office
          <source>of Analysis and Epidemiology. NCHS Urban-Rural Classification Scheme for Counties</source>
          .
          <year>2017</year>
          . https://www.cdc.gov/nchs/data_access/urban_rural.htm#2013_
          <article-title>Urban-Rural_Classification_Scheme_for_</article-title>
          <source>Counties. Accessed 8 Jul</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Whetzel</surname>
            <given-names>PL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            <given-names>NF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            <given-names>NH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexander</surname>
            <given-names>PR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nyulas</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tudorache</surname>
            <given-names>T</given-names>
          </string-name>
          , et al.
          <article-title>BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <year>2011</year>
          ;39 Web Server issue:
          <fpage>W541</fpage>
          -
          <lpage>545</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Davis</surname>
            <given-names>MF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rankin</surname>
            <given-names>SC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schurer</surname>
            <given-names>JM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cole</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conti</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabinowitz</surname>
            <given-names>P</given-names>
          </string-name>
          , et al.
          <article-title>Checklist for One Health Epidemiological Reporting of Evidence (COHERE)</article-title>
          .
          <source>One Health Amst Neth</source>
          .
          <year>2017</year>
          ;
          <volume>4</volume>
          :
          <fpage>14</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>von Elm</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altman</surname>
            <given-names>DG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Egger</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pocock</surname>
            <given-names>SJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gøtzsche</surname>
            <given-names>PC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandenbroucke</surname>
            <given-names>JP</given-names>
          </string-name>
          , et al. [
          <article-title>The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies]</article-title>
          .
          <source>Rev Esp Salud Publica</source>
          .
          <year>2008</year>
          ;
          <volume>82</volume>
          :
          <fpage>251</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>