<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automating Population Health Studies Through Semantics and Statistics</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy NY 12180</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rapid development of the Semantic Web, machines are able to understand the contextual meaning of data, including in the eld of automated semantics-driven statistical reasoning. This paper introduces a semantics-driven automated approach for solving population health problems with descriptive statistical models. A fusion of semantic and machine learning techniques enables our semantically-targeted analytics framework to automatically discover informative subpopulations that have subpopulation-speci c risk factors signi cantly associated with health conditions such as hypertension and type II diabetes. Based on our health analysis ontology and knowledge graphs, the semanticallytargeted analysis automated architecture allows analysts to rapidly and dynamically conduct studies for di erent health outcomes, risk factors, cohorts, and analysis methods; it also lets the full analysis pipeline be modularly speci ed in a reusable domain-speci c way through the usage of knowledge graph cartridges, which are application-speci c fragments of the underlying knowledge graph. We evaluate the semanticallytargeted analysis framework for risk analysis using the National Health and Nutrition Examination Survey and conclude that this framework can be readily extended to solve many di erent learning and statistical tasks, and to exploit datasets from various domains in the future.</p>
      </abstract>
      <kwd-group>
        <kwd>Automated Machine Learning</kwd>
        <kwd>Semantic Representation</kwd>
        <kwd>Statistical Data and Metadata Publication</kwd>
        <kwd>Population Health</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Population health strives to improve the health outcomes of subject groups
through the analysis of enormous health-related datasets collected from
members of these groups [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. With the great advancements in data analytics and
increasing scope of population health datasets, accurate use of these data and
      </p>
    </sec>
    <sec id="sec-2">
      <title>Copyright c 2019 for this paper by its authors. Use permitted under Creative Com</title>
      <p>mons License Attribution 4.0 International (CC BY 4.0).
statistics will be required to monitor and improve population-wide health
situations. To understand the relationships between population health determinants
and outcomes, observational studies are performed on large patient databases.</p>
      <p>
        These databases include electronic health records and ongoing
populationwide surveys, such as the National Health and Nutrition Examination Survey
(NHANES, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) studied here. Run by the National Center for Health Statistics,
NHANES examines about 5000 subjects a year and serves as a primary data
resource for population health studies. These studies, however, often su er from
a limited scope, and many studies may require the same repeated domain-speci c
data preparation procedures. The objective of a study might be con ned to a
single health condition, a small number of risk factors, and a manually-chosen
subject cohort.
      </p>
      <p>
        In this work, we present a framework, semantically targeted analytics (STA),
for automatically generating population health statistical analyses. In order to
overcome limitations on study scope, we develop a semantic representation for
knowledge from key domains: survey design and analysis, health, and data
analytics. Integration of each of these domains via a consistent standard [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is
necessary for our system to formulate and answer meaningful questions. We
represent our knowledge as a knowledge graph (KG1), containing terms de ned by
domain-speci c best-practice ontologies.
      </p>
      <p>
        When subject cohorts are no longer manually chosen, there is no guarantee
that a linear statistical model will be su cient to explain associations found
in population health datasets. Thus, we utilize the supervised cadre model
(SCM, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), a machine learning technique that automatically discovers
informative subpopulations in datasets. For subpopulations within these
subpopulations, associations between response variables and features are approximately
linear. The SCM has already been applied to predictive analytics and precision
population health [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]; in Section 4, we integrate the SCM with STA.
      </p>
      <p>In STA, semantics encodes, captures, and isolates the domain knowledge
needed to model study de nitions, statistical techniques, and data. A key
component is the knowledge graph cartridge (hereafter cartridges): an
applicationspeci c subgraph of an underlying KG. Cartridges, further described in section
3.2, are a way to express special-purpose, application-speci c sub-graphs, to
augment the graph for analysis. They are implemented as RDF KGs and enable
an automated \plug and play" architecture. Further, cartridges are used either
as input when analysts choose to load them to perform a novel risk study, or
as output when the study nding is automatically written into them. Our
cartridges are sub-graphs that contribute to a larger analysis graph. Additionally,
our output cartridges de ne the results in a way that is consistent with the input
cartridges and contribute to the modularity of STA.</p>
      <p>To model and represent the components of our cartridges, we built a Health
Analytics Ontology (HAO2). HAO models the domain knowledge, analytics</p>
    </sec>
    <sec id="sec-3">
      <title>1 Here, KG refers to a graph that describes real world entities and their interrelations</title>
      <p>
        while and enumerating the possible classes and relations of these entities. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
      </p>
    </sec>
    <sec id="sec-4">
      <title>2 The HAO is hosted at https://github.com/TheRensselaerIDEA/hao-ontology.</title>
      <p>knowledge, and other analytics pipeline components necessary for population
health analysis.</p>
      <p>The main contributions of this paper are a semantic representation of
population health analysis work ows and results as knowledge graph cartridges, the
integration of this representation with precision machine learning techniques for
the discovery subpopulation-speci c risk factors, and the demonstration of how
the STA framework enables rigorous investigation of population health problems.
Via cartridges, our STA framework can analyze, interpret, and report studies
performed on a wide variety of chronic health conditions and potential risk factors.
In Section 5, we present and examine the discoveries found by applying STA to
the task of subpopulation-speci c identi cation of risk factors associated with
prediabetes and increased total cholesterol levels. Our framework successfully
identi es risk factors that are not picked up by standard population-level risk
analysis.
1.1</p>
      <sec id="sec-4-1">
        <title>Related work</title>
        <p>
          Our primary inspiration for the KG cartridge is the Oracle database systems
notion of a data cartridge [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]; similarities can also be found in the theory of modular
ontology design [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and cheminformatics chemical cartridges [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Just as in the
data cartridge, our cartridges are mechanisms for extending the capabilities of
some underlying system. We di er in how our underlying system is implemented:
data cartridges extend an Oracle server, but KG cartridges are implemented as
and extend knowledge graphs while integrating with data analytics models.
        </p>
        <p>
          HAO is inspired by several existing analytics-focused ontologies, including
the Data Science Ontology (DSO) associated with the semantic ow graph [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
approach and the analytics ontology associated with the ScalaTion [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
framework. In semantic ow graphs, functions in an analysis script are mapped to
abstract concepts de ned in the DSO; graph visualization allows for
languageindependent work ow summarization. With ScalaTion, axioms and an analytics
model taxonomy allow model selection to be performed via inference. In STA
and HAO, we focus on the problem of domain-guided subpopulation-based health
analysis in survey-weighted data [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Note that subpopulation discovery and
representation require descriptive rather than predictive modeling work ows.
        </p>
        <p>
          With a similar goal as Automated Machine Learning (AutoML, [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]),
especially the Automatic Statistician [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], we aim to automate end-to-end statistical
analysis. Our work di ers from existing AutoML in several key areas. First, STA
utilizes domain-dependent analysis techniques. The Automatic Statistician does
not represent domain knowledge semantically; also, much of its work has been
applied to nonparametric Bayesian models, such as Gaussian processes for time
series [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In contrast, STA utilizes a variety of parametric statistical models,
and we focus on the special case of data generated by a complex survey
design. Unlike in a nonparametric model, the parameters of our models can act
as explainable summaries of discovered associations. Finally, we note that the
strategies of the Automatic Statistician or any other machine learning method
can be readily incorporated into STA by representation in the KG.
        </p>
        <sec id="sec-4-1-1">
          <title>Risk analysis in NHANES</title>
          <p>Algorithm 1 illustrates how the STA framework uses semantic structures realized
as cartridges to drive precision health subpopulation discovery and risk factor
identi cation. In STAGE I, the STA framework queries its input cartridges to
infer requirements for data preparation. This might entail ltering records for
subjects that satisfy study inclusion criteria, log-transforming right-skewed
variables, or constructing new variables based on supplied de nitions. In STAGE
II, an array of SCMs is trained on the prepared cohort using di erent
hyperparameter con gurations. A nal model is determined by supplied model selection
metrics, e.g., the Bayesian Information Criterion (BIC). In STAGE III,
surveyweighted generalized linear models (GLMs) are trained on each discovered
subpopulation. The regression coe cients and log-odds ratios estimated by these
GLMs quantify the association between the supplied risk factor and response
variable. After STAGE II and STAGE III, model ndings and subpopulation
characteristics are written to output cartridges for future reference.</p>
          <p>
            In Algorithm 1, we perform precision risk analysis for a single risk factor. In
practice, we repeat this process for many di erent categories of risk factors,
yielding a precision environment-wide association study (EWAS, [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]). Similarly, the
same risk factor can be tested against multiple potential response variables.
Output cartridges generated by analyses are written back to the knowledge graph,
where they are linked to the input cartridges used to generate them. This
linkage grants STA explainability: all details of provenance and execution steps are
captured, enabling detailed justi cations for conclusions to be generated.
Storing each piece of data and metadata in a knowledge graph also enables analysis
reproducibility, since all details are kept together.
          </p>
          <p>We present the STA framework for addressing population health problems.
By varying models, variables, and the underlying datasets, we can adapt this
work ow for other tasks. Classi cation and multiple regression models are used
to identify potential risk factors { covariates that are strongly associated with the
response variable in the study cohort, after controlling for known confounders.</p>
          <p>
            NHANES is constructed with a multistage complex survey design (CSD, [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ])
for each year. An NHANES subject's role in the CSD is captured by their survey
weight, stratum, and variance unit; the STA framework encodes these values in
the KG and then automatically utilizes them correctly in analyses. Since CSD
data is not iid, incorporation of survey weights is necessary to attain unbiased
statistical estimates. SCMs in STAGE II of Algorithm 1 use survey weights and
Stage III uses the survey package in R, to create survey-weighted GLMs that
incorporate the weights, strata, and variance units encoded in the KG.
3
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Serialized components for automatic analysis</title>
          <p>Minimal necessary analysis components from the HAO are stored in modular
serializations called cartridges. A cartridge is a subgraph containing
applicationand analysis-speci c entities. Cartridges can be edited to include additional
elements, thereby enabling the exibility to address a range of problems with
STAGE I: DATA PREPARATION</p>
          <p>Select all subjects satisfying cohort-cartridge's inclusion criteria and store
as cohort
Query parameters-cartridge and risk-factor-cartridge for preprocessing
techniques and apply to cohort
Query response-cartridge and risk-factor-cartridge for necessary
control-variables
Query risk-factor-cartridge for risk-factor
Query response-cartridge for response-variable
Query parameters-cartridge for model-selection metric
Calculate population-level summary statistics of cohort and write to
subpopulation-cartridge
STAGE II: SUBPOPULATION DISCOVERY
for every hyperparameter con guration in parameters-cartridge do
Train SCM on cohort using control-variables and risk-factor to
predict response-variable</p>
          <p>Calculate SCM's metric value
end</p>
          <p>Identify SCM with optimal metric value and write its optimal hyperparameters
and parameters to model-cartridge</p>
          <p>Take optimal SCM and write to model-cartridge, serialized as pickle
STAGE III: RISK MODELING
for every subpopulation discovered by SCM do</p>
          <p>Select members of cohort belonging to subpopulation
Calculate summary statistics of subpopulation and write to
subpopulation-cartridge
Train survey-weighted GLM on subpopulation using control-variables
and risk-factor to predict response-variable
Extract risk-factor's p-value, regression coe cient, and regression
coe cient standard error from GLM and write to results-cartridge
end
Algorithm 1: STA SCM analysis for subpopulation-speci c or precision risk
analysis of a single risk factor. Implemented variants include examining many
potential risk factors in succession via an EWAS, as well as the addition of
STAGE IV: REPORT GENERATION, in which the output cartridges are
used to automatically create a report describing the ndings. Reports use text,
tables in the style of Table 3, and gures in the style of Fig. 2.
minimal modi cation. In Section 5, we demonstrate this versatility. In STA,
cartridges are loaded and modi ed as the user constructs their risk analysis study.
In the Health Analysis Ontology (HAO), we support modeling of processes,
components, models, variables and factors involved in a health analysis pipeline
such as the one described in Algorithm 1. The HAO reuses classes and properties
from existing ontologies, listed in Table 2, but we also found it necessary to
introduce new terminology.</p>
          <p>We represent the ontology using OWL and introduce property associations
between classes using owl:Restrictions. Overall, HAO provides a vocabulary
necessary to model the reusable components of an analysis (sio:Analysis)
implemented by an analysis work ow (hao:AnalysisWork ow) that we store in
cartridges (hao:Cartridge). Cartridges serve as containers that encode
information about speci c portions of a work ow. For example, a response cartridge
(hao:ResponseCartridge) contributes to a high-level overview of a model with
entities (modeled via sio:hasAttribute) such as the analysis question, response
variable, type of model, etc. The HAO schema allows for the representation of
cartridges as named knowledge graphs in the TriG format3.</p>
          <p>The HAO ontology only imports the SemanticScience Integrated Ontology
(SIO), as we reuse several classes from SIO and utilize their object properties</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3 Learn more at https://www.w3.org/TR/trig/</title>
      <p>
        to de ne associations between classes. For other terms that we reuse from large
ontologies such as the National Cancer Institute Thesaurus (NCIT), the
Statistical Methods Ontology (STATO) and the Ontology of Biological and Clinical
Statistics (OBCS), we apply the Minimum Information to Reference an
External Ontology Term (MIREOT, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) technique to include terms. HAO
combines terminology from statistical, scienti c and biomedical ontologies to model
a reusable and modular health analysis pipeline. Additionally, to provide
information on the intended usage of classes, we maintain metadata such as de
nitions (skos:de nition) and descriptions (rdfs:description) on our ontology classes.
We have tested the logical correctness of HAO by reasoning using the Hermit
reasoner [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] in Protege [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The HAO ontology can be explored via online
documentation4 generated by using the Widoco [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] tool.
3.2
      </p>
      <sec id="sec-5-1">
        <title>Cartridges</title>
        <p>Cartridges can be grouped into two categories, input and output, with further
subdivisions given in Table 1. Fig. 1 gives a high-level summary of the cartridge
framework. In practice, cartridges are implemented as named graph collections
(in TriG format) encapsulating instances of ontology classes that, when grouped
together, represent di erent modules of an analysis work ow. Further, cartridges
are constructed using terms from ontologies listed in Table 2. Domain-speci c
choices (e.g., choice of confounders or cohort inclusion criteria) about cartridge
contents are adapted from published studies and linked with provenance. In the
case that outdated or inaccurate knowledge is retired, this provenance shows
what cartridges need to be updated.</p>
        <p>Currently, input cartridges must be manually de ned by domain specialists,
but output cartridges are generated automatically after analysis. Minimal
modication is needed to allow an input cartridge to be applied to a di erent analytics
question. Cartridges can be edited to allow for the exible tailoring of a health
analysis pipeline to discover new subpopulations (stato:0000203 - cohort),
identify new outcomes or test di erent response variables (hao:TargetVariable). For
example, creating a new analysis of hypertension based on a type 2 diabetes
analysis requires only a simple edit of the response cartridge; the other input
cartridges remain the same. We maintain analysis related concepts in HAO,
and for cartridges such as the subpopulation cartridge requiring domain-speci c
terminology we directly reference terms from ontologies in the eld, within the
cartridge. Additionally, as shown in Fig. 1, cartridges contain links to other
cartridges that were used to generate it, to allow for easy traversal of all the
components of a work ow.
4</p>
        <sec id="sec-5-1-1">
          <title>Precision risk with supervised cadres</title>
          <p>
            Our method for precision risk is the supervised cadre model [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], which
simultaneously discovers subpopulations and learns their risk models. We use
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4 https://therensselaeridea.github.io/hao-ontology/WidocoDocumentation/doc/index</title>
      <p>en.html</p>
      <p>Cartridge category Cartridge type Contents
Input Response hAenaaltlyhsicsoncodnitcieopnts and background domain axioms necessary to model a given
Cohort Isntucdluys,iownhiccrhitecraina buesecdhotosedneotenr-mthien-e yif oargaivdeanptseudbfjreocmtmeaxyistbineginsctluuddieeds in the user's
Risk factor How categories of semantically-similar risk factors should be modeled</p>
      <p>Parameters cRounlesgutroactioomnsplfeotrecchhoosseennmanoadleylsis work ow and potential hyperparameter
Output Model tTrhaeinhinygp,earpnadrathmeertuerlessubsyedwthoicthraiitnisa ampopdlieeld, tthoenpeawraombseetrevraetsiotinmsates learned during</p>
      <p>Summary statistics characterizing discovered subpopulations,
Subpopulation including within-subpopulation variable means and rates
Results aQnudanthtie craestpioonnsoef vsuabripaobpleuluastiinogn-rsepgerceissciodnisccooeverceidenatss,sosctiaantdioanrsdbeerrtworese,nanthdepr-ivsaklufaecstor</p>
      <p>Table 1. Types of cartridges used in STA framework
Ontology Pre x Purpose
Health Analysis Ontology hao Inform analysis design, summarize analysis results for comparison, and generate reports</p>
    </sec>
    <sec id="sec-7">
      <title>Study Cohort Ontology sco TRaebplreesseonftocbosheorvrattvioanriaalbcleasseanstdudcoienstraonld/inctlienrivceanlttiroinalgsroups in Cohort Summary</title>
      <p>Children's Health Exposure Analysis Resource chear Represent the inclusion of environmental exposures in health research
The Statistical Methods Ontology stato Represent concepts and properties related to statistical methods and analysis</p>
    </sec>
    <sec id="sec-8">
      <title>Semanticscience Integrated Ontology sio aPcrroovsisdephaynsiucpalp,eprrloecveeslsoanntdoloingfyor(mtyapteios,nraellaentitoitnise)s for consistent knowledge representation</title>
    </sec>
    <sec id="sec-9">
      <title>National Cancer Institute Thesaurus ncit iNtsCbITroaisdacnovaeurtahgoeriatnadtivuesereifterreonoctetotetremrminionloogloygyinotnhemcoadnecle-rredlaotmedaipna,rbaumteitnerosur case we leverage</title>
    </sec>
    <sec id="sec-10">
      <title>Ontology for Biomedical Investigations obi gAennneoratateted bainodmtehdeictaylpinesveosftiagnaatliyosniss,pinecrflourdminegdtohne tshtueddyatdaesign, protocols used, the data</title>
      <p>The PROV Ontology prov Model provenance information for di erent applications and domains
Ontology of Biological and Clinical Statistics obcs Represent additional biostatistics terms not in OBI
DC Terms dct Specify all metadata terms maintained by the Dublin Core Metadata Initiative
Simple Knowledge Organization System skos De ne the new terms in the HAO
Table 2. Ontologies currently used in STA. The usage of these ontologies are described
in sections 3.1 and 3.2
subpopulation-speci c and precision interchangeably. The SCM is applied
during STAGE II of Algorithm 1. Subpopulations, which we call cadres, are subsets
of the population de ned with respect to a cadre-assignment rule learned by
the SCM. Subjects in the same cadre have the same association with a given
risk factor. In STA, the chosen parameter and response cartridges set up the
appropriate SCM and describe how to tune its hyperparameters. Optimal model
parameters and hyperparameters are written to a model cartridge, which can be
applied to novel subject records to determine their cadre.</p>
      <p>We outline SCM for multivariate regression and binary classi cation. When
trained on a set of subject records fxng RP and response values fyng, the
SCM divides the observations into a set of M cadres. Each cadre m is
characterized by a center cm 2 RP and a linear regression function em parameterized by
weights wm 2 RP and a bias wm0 2 R. New observations x have (for multivariate
regression) an aggregate regression score (e.g., a subject's expected total
cholesterol level) or (for binary classi cation) an aggregate risk score (e.g., the logit
of their probability of having prediabetes) given by f (x) = PM
m=1 gm(x)em(x),
where gm(x) is the probability x belongs to cadre m, and em(x) is the regression
or risk score for x were it known to belong to cadre m. These have the form
gm(x) = Pme0 e jjxjjxcmcjmj2d0 jj2d and em(x) = (wm)T x + wm0:</p>
      <p>
        Here, jjzjjd = Pp jdpj(zp)2 1=2 is a seminorm parameterized by d 2 RP and
&gt; 0 is a hyperparameter. SCM parameters are obtained by applying
stochastic gradient descent to a survey-weighted loss function based on mean squared
error or logistic loss, along with elastic net [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] regularization to improve
interpretablity. The hyperparameters are chosen via a grid-search procedure and
recorded in the chosen parameters cartridge. Compared to other nonlinear
machine learning techniques, SCMs are more interpretable because of their
withinsubpopulation linearity. Examining the properties of each subpopulation and
linear prediction model can yield signi cant insights. We have prototyped a
system using the shiny R package that interacts with the user to design and conduct
a study and then automatically generates interactive reports with text and
gures explaining the results driven by the results cartridges and other external
domain-speci c linked-data. Sample results are presented in the next section.
5
      </p>
      <sec id="sec-10-1">
        <title>Results</title>
        <p>
          We present two risk analyses to identify subpopulation-speci c environmental
exposure factors associated with total cholesterol (TC) and prediabetic-or-worse
glycohemoglobin levels (prediabetes). Elevated levels of serum lipids such as
TC are recognized as risk factors for cardiovascular disease, and associations
between TC and environmental exposure levels were identi ed previously [
          <xref ref-type="bibr" rid="ref2 ref23">2, 23</xref>
          ].
Other work also discovered associations between diabetes and environmental
risk factors [
          <xref ref-type="bibr" rid="ref10 ref16">16, 10</xref>
          ]. Thus, it is worthwhile to identify subpopulation-speci c risk
factors associated with TC and prediabetes to improve health situations.
        </p>
        <p>
          We chose a set of input cartridges shown in Table 3 for TC using control
variables from prior studies [
          <xref ref-type="bibr" rid="ref16 ref23">16, 23</xref>
          ]. We extract 201 environmental exposure
potential risk factors from NHANES 1999 to 2014 grouped into 17 classes such as
phthalates (PHT) or polyaromatic hydrocarbons (PAH). Each class of potential
risk factors has its own cartridge that describes its usage in analytics models.
However, on the GitHub repository5 we only host an example of the heavy metals
risk factor cartridge used in this analysis. The number of survey subjects that
have measurements for a given risk factor ranges from 1,406 to 15,218.
        </p>
        <p>With our input cartridges, we run Algorithm 1 for every potential risk factor.
Each risk factor is included in a single SCM that discovers subpopulations in
the data. In STAGE II of Algorithm 1, each discovered subpopulation has its
summary statistics written to a subpopulation cartridge to be stored in the
KG. Characteristics of subpopulations with signi cant positive associations are
visualized in Fig. 2A. In STAGE III of Algorithm 1, each subpopulation has a
survey-weighted GLM trained on it, and the risk factor's regression coe cient
and p-value are extracted. Due to the large number of hypothesis tests, false
discovery correction is applied to these p-values before assessing signi cance
at a threshold speci ed in the study's parameters cartridge (here, = 0:02).</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5 Visit: https://github.com/TheRensselaerIDEA/hao-ontology</title>
      <p>Cartridge type Contents</p>
      <p>TC is a continuous response variable; subjects' age, Body Mass Index (BMI),
Response Poverty Income Ratio (PIR), smoking habits, drinking habits, gender, marital status,
and education level should be controlled for
Cohort All available NHANES subjects
Risk factor 201 environmental exposure risk factors divided into 17 categories</p>
      <p>Standardize risk factor measurements; train models with M = 1; 2 and 3 cadres
Parameters and choose best one using BIC for model selection; signi cance threshold of = 0:02
for GLM hypothesis tests
Table 3. Chosen input cartridges for TC risk study. The prediabetes risk study uses the
same cohort, risk factor, and parameters cartridge, with a di erent response cartridge.
We have presented a semantically-targeted analytics framework via which risk
factors speci c to a subpopulation may be discovered in datasets. With the
supervised cadre machine learning method, we simultaneously discover
subpopulations and identify their signi cant risk factors. To support this we built a novel
Health Analysis Ontology that captures analytics and health domain knowledge.</p>
      <p>HAO and other ontologies provide structure for de ning cartridges that are
used for modular analysis pipelines. We leverage this semantic modeling to
dynamically construct and execute a risk model and interpret results. Using STA,
the system provides explainable insights for future population health studies in
a scienti cally rigorous and reproducible way.</p>
      <p>In STA, statistical ndings and parameters are encoded in results cartridges
and written back to the KG, enabling retrieval for further study. Cartridges
provide semantic extensions that enable a KG system to apply inference to solve
domain-speci c analytics problems. By publishing the results cartridges, studies
become reproducible and explainable with provenance. Researchers with new
analysis methods can readily compare results with prior studies, using the same
work ow on the same problems. They can adapt existing peer-reviewed studies
to new diseases by editing cartridge in published work ows.</p>
      <p>We report here on semantically-targeted analytics applied to population
health studies that rapidly enables new ndings from the ongoing NHANES
database. Using new cartridges, STA can readily be adapted to other types of
statistical analysis on other data sources such as electronic health care records.</p>
      <sec id="sec-11-1">
        <title>Acknowledgements</title>
        <p>Created with support by IBM Research AI through the AI Horizons Network.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Al-Baltah</surname>
            ,
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghani</surname>
            ,
            <given-names>A.A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>W.N.W.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atan</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A classi cation of semantic con icts in heterogeneous web services at message level</article-title>
          . Turkish Jnl of Electrical Engineering &amp; Computer
          <string-name>
            <surname>Sciences</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aminov</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavuk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpenter</surname>
            ,
            <given-names>D.O.</given-names>
          </string-name>
          :
          <article-title>Analysis of the e ects of exposure to polychlorinated biphenyls and chlorinated pesticides on serum lipid levels in residents of Anniston, Alabama</article-title>
          .
          <source>Environ Health</source>
          <volume>12</volume>
          ,
          <issue>108</issue>
          (Dec
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <article-title>Centers for Disease Control and Prevention (CDC): National Health and Nutrition Examination Survey (</article-title>
          <year>2017</year>
          ), http://www.cdc.gov/nchs/nhanes/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Courtot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lister</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malone</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schober</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brinkman</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruttenberg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mireot: The minimum information to reference an external ontology term</article-title>
          .
          <source>Applied Ontology</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <volume>23</volume>
          {
          <fpage>33</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          :
          <article-title>Why \population health"</article-title>
          ?
          <source>Canadian Journal of Public Health</source>
          <volume>86</volume>
          (
          <issue>3</issue>
          ),
          <volume>162</volume>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Widoco: a wizard for documenting ontologies</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <volume>94</volume>
          {
          <fpage>102</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gietz</surname>
          </string-name>
          , W.:
          <article-title>What is a data cartridge? In: Data Cartridge Developer's Guide, chap</article-title>
          . 1, pp.
          <volume>1</volume>
          {
          <fpage>17</fpage>
          .
          <string-name>
            <surname>Oracle Corporation</surname>
          </string-name>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Heeringa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>West</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berglund</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Applied Survey Data Analysis</article-title>
          . Chapman and Hall/CRC.,
          <volume>2</volume>
          <fpage>edn</fpage>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Knublauch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergerson</surname>
            ,
            <given-names>R.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>The protege owl plugin: An open development environment for semantic web applications</article-title>
          . In: International Semantic Web Conference. pp.
          <volume>229</volume>
          {
          <fpage>243</fpage>
          . Springer (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Association of urinary cadmium with risk of diabetes: a meta-analysis</article-title>
          .
          <source>Environmental Science and Pollution Research</source>
          <volume>24</volume>
          (
          <issue>11</issue>
          ),
          <volume>10083</volume>
          {10090 (Apr
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lloyd</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duvenaud</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosse</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenenbaum</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Automatic construction and natural-language description of nonparametric regression models</article-title>
          .
          <source>In: Proc. of the Twenty-Eighth AAAI Conf</source>
          . pp.
          <volume>1242</volume>
          {
          <issue>1250</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monge</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duret</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gualandi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peitsch</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pospisil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Building</surname>
            an
            <given-names>R</given-names>
          </string-name>
          &amp;
          <article-title>D chemical registration system</article-title>
          .
          <source>J Cheminform</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>11</volume>
          (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>New</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>K.P.:</given-names>
          </string-name>
          <article-title>A precision environment-wide association study of hypertension via supervised cadre models</article-title>
          .
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          (
          <year>2019</year>
          ), to appear
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>New</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breneman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          :
          <article-title>Cadre modeling: Simultaneously discovering subpopulations and predictive models</article-title>
          .
          <source>In: 2018 Intl. Joint Conf. on Neural Networks (IJCNN)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>July 2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Nural</surname>
            ,
            <given-names>M.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.A.</surname>
          </string-name>
          :
          <source>Automated Predictive Big Data Analytics Using Ontology Based Semantics. Int J Big Data</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <volume>43</volume>
          {56 (Oct
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butte</surname>
            ,
            <given-names>A.J.:</given-names>
          </string-name>
          <article-title>An environment-wide association study (EWAS) on type 2 diabetes mellitus</article-title>
          .
          <source>PLOS ONE 5</source>
          (
          <issue>5</issue>
          ),
          <volume>1</volume>
          {
          <fpage>10</fpage>
          (05
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Johnson, T.M.,
          <string-name>
            <surname>Chute</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          :
          <article-title>Survey of modular ontology techniques and their applications in the biomedical domain</article-title>
          .
          <source>Integr Comput Aided Eng</source>
          <volume>16</volume>
          (
          <issue>3</issue>
          ),
          <volume>225</volume>
          {242 (Aug
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Patterson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldini</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mojsilovic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varshney</surname>
            ,
            <given-names>K.R.</given-names>
          </string-name>
          :
          <article-title>Teaching machines to understand data science code by semantic enrichment of data ow graphs</article-title>
          . CoRR abs/
          <year>1807</year>
          .05691 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Knowledge graph re nement: A survey of approaches and evaluation methods</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          ,
          <issue>489</issue>
          {
          <fpage>508</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Shearer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Hermit: A highly-e cient owl reasoner</article-title>
          .
          <source>In: Owled</source>
          . vol.
          <volume>432</volume>
          , p.
          <volume>91</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Steinrucken</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Janz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lloyd</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>The automatic statistician</article-title>
          .
          <source>In: Automatic Machine Learning: Methods, Systems</source>
          , Challenges. pp.
          <volume>175</volume>
          {
          <issue>188</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , et al.:
          <article-title>Taking human out of learning applications: A survey on automated machine learning (</article-title>
          <year>2018</year>
          ), https://arxiv.org/abs/
          <year>1810</year>
          .13306
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          , et al.:
          <article-title>Phthalate exposure and high blood pressure in adults: a crosssectional study in China</article-title>
          .
          <source>Env. Sci. and Pollution Research</source>
          <volume>25</volume>
          (
          <issue>16</issue>
          ),
          <volume>15934</volume>
          {15942 (Jun
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Regularization and variable selection via the elastic net</article-title>
          .
          <source>Journal of the Royal Stat. Society: Series B</source>
          <volume>67</volume>
          (
          <issue>2</issue>
          ),
          <volume>301</volume>
          {
          <fpage>320</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>