-

Automating Population Health Studies Through Semantics and Statistics

0 Rensselaer Polytechnic Institute , Troy NY 12180 , USA

With the rapid development of the Semantic Web, machines are able to understand the contextual meaning of data, including in the eld of automated semantics-driven statistical reasoning. This paper introduces a semantics-driven automated approach for solving population health problems with descriptive statistical models. A fusion of semantic and machine learning techniques enables our semantically-targeted analytics framework to automatically discover informative subpopulations that have subpopulation-speci c risk factors signi cantly associated with health conditions such as hypertension and type II diabetes. Based on our health analysis ontology and knowledge graphs, the semanticallytargeted analysis automated architecture allows analysts to rapidly and dynamically conduct studies for di erent health outcomes, risk factors, cohorts, and analysis methods; it also lets the full analysis pipeline be modularly speci ed in a reusable domain-speci c way through the usage of knowledge graph cartridges, which are application-speci c fragments of the underlying knowledge graph. We evaluate the semanticallytargeted analysis framework for risk analysis using the National Health and Nutrition Examination Survey and conclude that this framework can be readily extended to solve many di erent learning and statistical tasks, and to exploit datasets from various domains in the future.

Automated Machine Learning Semantic Representation Statistical Data and Metadata Publication Population Health

Population health strives to improve the health outcomes of subject groups through the analysis of enormous health-related datasets collected from members of these groups [ 5 ]. With the great advancements in data analytics and increasing scope of population health datasets, accurate use of these data and

mons License Attribution 4.0 International (CC BY 4.0). statistics will be required to monitor and improve population-wide health situations. To understand the relationships between population health determinants and outcomes, observational studies are performed on large patient databases.

These databases include electronic health records and ongoing populationwide surveys, such as the National Health and Nutrition Examination Survey (NHANES, [ 3 ]) studied here. Run by the National Center for Health Statistics, NHANES examines about 5000 subjects a year and serves as a primary data resource for population health studies. These studies, however, often su er from a limited scope, and many studies may require the same repeated domain-speci c data preparation procedures. The objective of a study might be con ned to a single health condition, a small number of risk factors, and a manually-chosen subject cohort.

In this work, we present a framework, semantically targeted analytics (STA), for automatically generating population health statistical analyses. In order to overcome limitations on study scope, we develop a semantic representation for knowledge from key domains: survey design and analysis, health, and data analytics. Integration of each of these domains via a consistent standard [ 1 ] is necessary for our system to formulate and answer meaningful questions. We represent our knowledge as a knowledge graph (KG1), containing terms de ned by domain-speci c best-practice ontologies.

When subject cohorts are no longer manually chosen, there is no guarantee that a linear statistical model will be su cient to explain associations found in population health datasets. Thus, we utilize the supervised cadre model (SCM, [ 14 ]), a machine learning technique that automatically discovers informative subpopulations in datasets. For subpopulations within these subpopulations, associations between response variables and features are approximately linear. The SCM has already been applied to predictive analytics and precision population health [ 13 ]; in Section 4, we integrate the SCM with STA.

In STA, semantics encodes, captures, and isolates the domain knowledge needed to model study de nitions, statistical techniques, and data. A key component is the knowledge graph cartridge (hereafter cartridges): an applicationspeci c subgraph of an underlying KG. Cartridges, further described in section 3.2, are a way to express special-purpose, application-speci c sub-graphs, to augment the graph for analysis. They are implemented as RDF KGs and enable an automated \plug and play" architecture. Further, cartridges are used either as input when analysts choose to load them to perform a novel risk study, or as output when the study nding is automatically written into them. Our cartridges are sub-graphs that contribute to a larger analysis graph. Additionally, our output cartridges de ne the results in a way that is consistent with the input cartridges and contribute to the modularity of STA.

To model and represent the components of our cartridges, we built a Health Analytics Ontology (HAO2). HAO models the domain knowledge, analytics

1 Here, KG refers to a graph that describes real world entities and their interrelations

while and enumerating the possible classes and relations of these entities. [ 19 ]

2 The HAO is hosted at https://github.com/TheRensselaerIDEA/hao-ontology.

knowledge, and other analytics pipeline components necessary for population health analysis.

The main contributions of this paper are a semantic representation of population health analysis work ows and results as knowledge graph cartridges, the integration of this representation with precision machine learning techniques for the discovery subpopulation-speci c risk factors, and the demonstration of how the STA framework enables rigorous investigation of population health problems. Via cartridges, our STA framework can analyze, interpret, and report studies performed on a wide variety of chronic health conditions and potential risk factors. In Section 5, we present and examine the discoveries found by applying STA to the task of subpopulation-speci c identi cation of risk factors associated with prediabetes and increased total cholesterol levels. Our framework successfully identi es risk factors that are not picked up by standard population-level risk analysis. 1.1

Related work

Our primary inspiration for the KG cartridge is the Oracle database systems notion of a data cartridge [ 7 ]; similarities can also be found in the theory of modular ontology design [ 17 ] and cheminformatics chemical cartridges [ 12 ]. Just as in the data cartridge, our cartridges are mechanisms for extending the capabilities of some underlying system. We di er in how our underlying system is implemented: data cartridges extend an Oracle server, but KG cartridges are implemented as and extend knowledge graphs while integrating with data analytics models.

HAO is inspired by several existing analytics-focused ontologies, including the Data Science Ontology (DSO) associated with the semantic ow graph [ 18 ] approach and the analytics ontology associated with the ScalaTion [ 15 ] framework. In semantic ow graphs, functions in an analysis script are mapped to abstract concepts de ned in the DSO; graph visualization allows for languageindependent work ow summarization. With ScalaTion, axioms and an analytics model taxonomy allow model selection to be performed via inference. In STA and HAO, we focus on the problem of domain-guided subpopulation-based health analysis in survey-weighted data [ 8 ]. Note that subpopulation discovery and representation require descriptive rather than predictive modeling work ows.

With a similar goal as Automated Machine Learning (AutoML, [ 22 ]), especially the Automatic Statistician [ 21 ], we aim to automate end-to-end statistical analysis. Our work di ers from existing AutoML in several key areas. First, STA utilizes domain-dependent analysis techniques. The Automatic Statistician does not represent domain knowledge semantically; also, much of its work has been applied to nonparametric Bayesian models, such as Gaussian processes for time series [ 11 ]. In contrast, STA utilizes a variety of parametric statistical models, and we focus on the special case of data generated by a complex survey design. Unlike in a nonparametric model, the parameters of our models can act as explainable summaries of discovered associations. Finally, we note that the strategies of the Automatic Statistician or any other machine learning method can be readily incorporated into STA by representation in the KG.

Risk analysis in NHANES

Algorithm 1 illustrates how the STA framework uses semantic structures realized as cartridges to drive precision health subpopulation discovery and risk factor identi cation. In STAGE I, the STA framework queries its input cartridges to infer requirements for data preparation. This might entail ltering records for subjects that satisfy study inclusion criteria, log-transforming right-skewed variables, or constructing new variables based on supplied de nitions. In STAGE II, an array of SCMs is trained on the prepared cohort using di erent hyperparameter con gurations. A nal model is determined by supplied model selection metrics, e.g., the Bayesian Information Criterion (BIC). In STAGE III, surveyweighted generalized linear models (GLMs) are trained on each discovered subpopulation. The regression coe cients and log-odds ratios estimated by these GLMs quantify the association between the supplied risk factor and response variable. After STAGE II and STAGE III, model ndings and subpopulation characteristics are written to output cartridges for future reference.

In Algorithm 1, we perform precision risk analysis for a single risk factor. In practice, we repeat this process for many di erent categories of risk factors, yielding a precision environment-wide association study (EWAS, [ 16 ]). Similarly, the same risk factor can be tested against multiple potential response variables. Output cartridges generated by analyses are written back to the knowledge graph, where they are linked to the input cartridges used to generate them. This linkage grants STA explainability: all details of provenance and execution steps are captured, enabling detailed justi cations for conclusions to be generated. Storing each piece of data and metadata in a knowledge graph also enables analysis reproducibility, since all details are kept together.

We present the STA framework for addressing population health problems. By varying models, variables, and the underlying datasets, we can adapt this work ow for other tasks. Classi cation and multiple regression models are used to identify potential risk factors { covariates that are strongly associated with the response variable in the study cohort, after controlling for known confounders.

NHANES is constructed with a multistage complex survey design (CSD, [ 8 ]) for each year. An NHANES subject's role in the CSD is captured by their survey weight, stratum, and variance unit; the STA framework encodes these values in the KG and then automatically utilizes them correctly in analyses. Since CSD data is not iid, incorporation of survey weights is necessary to attain unbiased statistical estimates. SCMs in STAGE II of Algorithm 1 use survey weights and Stage III uses the survey package in R, to create survey-weighted GLMs that incorporate the weights, strata, and variance units encoded in the KG. 3

Serialized components for automatic analysis

Minimal necessary analysis components from the HAO are stored in modular serializations called cartridges. A cartridge is a subgraph containing applicationand analysis-speci c entities. Cartridges can be edited to include additional elements, thereby enabling the exibility to address a range of problems with STAGE I: DATA PREPARATION

Select all subjects satisfying cohort-cartridge's inclusion criteria and store as cohort Query parameters-cartridge and risk-factor-cartridge for preprocessing techniques and apply to cohort Query response-cartridge and risk-factor-cartridge for necessary control-variables Query risk-factor-cartridge for risk-factor Query response-cartridge for response-variable Query parameters-cartridge for model-selection metric Calculate population-level summary statistics of cohort and write to subpopulation-cartridge STAGE II: SUBPOPULATION DISCOVERY for every hyperparameter con guration in parameters-cartridge do Train SCM on cohort using control-variables and risk-factor to predict response-variable

Calculate SCM's metric value end

Identify SCM with optimal metric value and write its optimal hyperparameters and parameters to model-cartridge

Take optimal SCM and write to model-cartridge, serialized as pickle STAGE III: RISK MODELING for every subpopulation discovered by SCM do

Select members of cohort belonging to subpopulation Calculate summary statistics of subpopulation and write to subpopulation-cartridge Train survey-weighted GLM on subpopulation using control-variables and risk-factor to predict response-variable Extract risk-factor's p-value, regression coe cient, and regression coe cient standard error from GLM and write to results-cartridge end Algorithm 1: STA SCM analysis for subpopulation-speci c or precision risk analysis of a single risk factor. Implemented variants include examining many potential risk factors in succession via an EWAS, as well as the addition of STAGE IV: REPORT GENERATION, in which the output cartridges are used to automatically create a report describing the ndings. Reports use text, tables in the style of Table 3, and gures in the style of Fig. 2. minimal modi cation. In Section 5, we demonstrate this versatility. In STA, cartridges are loaded and modi ed as the user constructs their risk analysis study. In the Health Analysis Ontology (HAO), we support modeling of processes, components, models, variables and factors involved in a health analysis pipeline such as the one described in Algorithm 1. The HAO reuses classes and properties from existing ontologies, listed in Table 2, but we also found it necessary to introduce new terminology.

We represent the ontology using OWL and introduce property associations between classes using owl:Restrictions. Overall, HAO provides a vocabulary necessary to model the reusable components of an analysis (sio:Analysis) implemented by an analysis work ow (hao:AnalysisWork ow) that we store in cartridges (hao:Cartridge). Cartridges serve as containers that encode information about speci c portions of a work ow. For example, a response cartridge (hao:ResponseCartridge) contributes to a high-level overview of a model with entities (modeled via sio:hasAttribute) such as the analysis question, response variable, type of model, etc. The HAO schema allows for the representation of cartridges as named knowledge graphs in the TriG format3.

The HAO ontology only imports the SemanticScience Integrated Ontology (SIO), as we reuse several classes from SIO and utilize their object properties

3 Learn more at https://www.w3.org/TR/trig/

to de ne associations between classes. For other terms that we reuse from large ontologies such as the National Cancer Institute Thesaurus (NCIT), the Statistical Methods Ontology (STATO) and the Ontology of Biological and Clinical Statistics (OBCS), we apply the Minimum Information to Reference an External Ontology Term (MIREOT, [ 4 ]) technique to include terms. HAO combines terminology from statistical, scienti c and biomedical ontologies to model a reusable and modular health analysis pipeline. Additionally, to provide information on the intended usage of classes, we maintain metadata such as de nitions (skos:de nition) and descriptions (rdfs:description) on our ontology classes. We have tested the logical correctness of HAO by reasoning using the Hermit reasoner [ 20 ] in Protege [ 9 ]. The HAO ontology can be explored via online documentation4 generated by using the Widoco [ 6 ] tool. 3.2

Cartridges

Cartridges can be grouped into two categories, input and output, with further subdivisions given in Table 1. Fig. 1 gives a high-level summary of the cartridge framework. In practice, cartridges are implemented as named graph collections (in TriG format) encapsulating instances of ontology classes that, when grouped together, represent di erent modules of an analysis work ow. Further, cartridges are constructed using terms from ontologies listed in Table 2. Domain-speci c choices (e.g., choice of confounders or cohort inclusion criteria) about cartridge contents are adapted from published studies and linked with provenance. In the case that outdated or inaccurate knowledge is retired, this provenance shows what cartridges need to be updated.

Currently, input cartridges must be manually de ned by domain specialists, but output cartridges are generated automatically after analysis. Minimal modication is needed to allow an input cartridge to be applied to a di erent analytics question. Cartridges can be edited to allow for the exible tailoring of a health analysis pipeline to discover new subpopulations (stato:0000203 - cohort), identify new outcomes or test di erent response variables (hao:TargetVariable). For example, creating a new analysis of hypertension based on a type 2 diabetes analysis requires only a simple edit of the response cartridge; the other input cartridges remain the same. We maintain analysis related concepts in HAO, and for cartridges such as the subpopulation cartridge requiring domain-speci c terminology we directly reference terms from ontologies in the eld, within the cartridge. Additionally, as shown in Fig. 1, cartridges contain links to other cartridges that were used to generate it, to allow for easy traversal of all the components of a work ow. 4

Precision risk with supervised cadres

Our method for precision risk is the supervised cadre model [ 14 ], which simultaneously discovers subpopulations and learns their risk models. We use

4 https://therensselaeridea.github.io/hao-ontology/WidocoDocumentation/doc/index

en.html

Cartridge category Cartridge type Contents Input Response hAenaaltlyhsicsoncodnitcieopnts and background domain axioms necessary to model a given Cohort Isntucdluys,iownhiccrhitecraina buesecdhotosedneotenr-mthien-e yif oargaivdeanptseudbfjreocmtmeaxyistbineginsctluuddieeds in the user's Risk factor How categories of semantically-similar risk factors should be modeled

Parameters cRounlesgutroactioomnsplfeotrecchhoosseennmanoadleylsis work ow and potential hyperparameter Output Model tTrhaeinhinygp,earpnadrathmeertuerlessubsyedwthoicthraiitnisa ampopdlieeld, tthoenpeawraombseetrevraetsiotinmsates learned during

Summary statistics characterizing discovered subpopulations, Subpopulation including within-subpopulation variable means and rates Results aQnudanthtie craestpioonnsoef vsuabripaobpleuluastiinogn-rsepgerceissciodnisccooeverceidenatss,sosctiaantdioanrsdbeerrtworese,nanthdepr-ivsaklufaecstor

Table 1. Types of cartridges used in STA framework Ontology Pre x Purpose Health Analysis Ontology hao Inform analysis design, summarize analysis results for comparison, and generate reports

Study Cohort Ontology sco TRaebplreesseonftocbosheorvrattvioanriaalbcleasseanstdudcoienstraonld/inctlienrivceanlttiroinalgsroups in Cohort Summary

Children's Health Exposure Analysis Resource chear Represent the inclusion of environmental exposures in health research The Statistical Methods Ontology stato Represent concepts and properties related to statistical methods and analysis

Semanticscience Integrated Ontology sio aPcrroovsisdephaynsiucpalp,eprrloecveeslsoanntdoloingfyor(mtyapteios,nraellaentitoitnise)s for consistent knowledge representation National Cancer Institute Thesaurus ncit iNtsCbITroaisdacnovaeurtahgoeriatnadtivuesereifterreonoctetotetremrminionloogloygyinotnhemcoadnecle-rredlaotmedaipna,rbaumteitnerosur case we leverage Ontology for Biomedical Investigations obi gAennneoratateted bainodmtehdeictaylpinesveosftiagnaatliyosniss,pinecrflourdminegdtohne tshtueddyatdaesign, protocols used, the data

The PROV Ontology prov Model provenance information for di erent applications and domains Ontology of Biological and Clinical Statistics obcs Represent additional biostatistics terms not in OBI DC Terms dct Specify all metadata terms maintained by the Dublin Core Metadata Initiative Simple Knowledge Organization System skos De ne the new terms in the HAO Table 2. Ontologies currently used in STA. The usage of these ontologies are described in sections 3.1 and 3.2 subpopulation-speci c and precision interchangeably. The SCM is applied during STAGE II of Algorithm 1. Subpopulations, which we call cadres, are subsets of the population de ned with respect to a cadre-assignment rule learned by the SCM. Subjects in the same cadre have the same association with a given risk factor. In STA, the chosen parameter and response cartridges set up the appropriate SCM and describe how to tune its hyperparameters. Optimal model parameters and hyperparameters are written to a model cartridge, which can be applied to novel subject records to determine their cadre.

We outline SCM for multivariate regression and binary classi cation. When trained on a set of subject records fxng RP and response values fyng, the SCM divides the observations into a set of M cadres. Each cadre m is characterized by a center cm 2 RP and a linear regression function em parameterized by weights wm 2 RP and a bias wm0 2 R. New observations x have (for multivariate regression) an aggregate regression score (e.g., a subject's expected total cholesterol level) or (for binary classi cation) an aggregate risk score (e.g., the logit of their probability of having prediabetes) given by f (x) = PM m=1 gm(x)em(x), where gm(x) is the probability x belongs to cadre m, and em(x) is the regression or risk score for x were it known to belong to cadre m. These have the form gm(x) = Pme0 e jjxjjxcmcjmj2d0 jj2d and em(x) = (wm)T x + wm0:

Here, jjzjjd = Pp jdpj(zp)2 1=2 is a seminorm parameterized by d 2 RP and > 0 is a hyperparameter. SCM parameters are obtained by applying stochastic gradient descent to a survey-weighted loss function based on mean squared error or logistic loss, along with elastic net [ 24 ] regularization to improve interpretablity. The hyperparameters are chosen via a grid-search procedure and recorded in the chosen parameters cartridge. Compared to other nonlinear machine learning techniques, SCMs are more interpretable because of their withinsubpopulation linearity. Examining the properties of each subpopulation and linear prediction model can yield signi cant insights. We have prototyped a system using the shiny R package that interacts with the user to design and conduct a study and then automatically generates interactive reports with text and gures explaining the results driven by the results cartridges and other external domain-speci c linked-data. Sample results are presented in the next section. 5

Results

We present two risk analyses to identify subpopulation-speci c environmental exposure factors associated with total cholesterol (TC) and prediabetic-or-worse glycohemoglobin levels (prediabetes). Elevated levels of serum lipids such as TC are recognized as risk factors for cardiovascular disease, and associations between TC and environmental exposure levels were identi ed previously [ 2, 23 ]. Other work also discovered associations between diabetes and environmental risk factors [ 16, 10 ]. Thus, it is worthwhile to identify subpopulation-speci c risk factors associated with TC and prediabetes to improve health situations.

We chose a set of input cartridges shown in Table 3 for TC using control variables from prior studies [ 16, 23 ]. We extract 201 environmental exposure potential risk factors from NHANES 1999 to 2014 grouped into 17 classes such as phthalates (PHT) or polyaromatic hydrocarbons (PAH). Each class of potential risk factors has its own cartridge that describes its usage in analytics models. However, on the GitHub repository5 we only host an example of the heavy metals risk factor cartridge used in this analysis. The number of survey subjects that have measurements for a given risk factor ranges from 1,406 to 15,218.

With our input cartridges, we run Algorithm 1 for every potential risk factor. Each risk factor is included in a single SCM that discovers subpopulations in the data. In STAGE II of Algorithm 1, each discovered subpopulation has its summary statistics written to a subpopulation cartridge to be stored in the KG. Characteristics of subpopulations with signi cant positive associations are visualized in Fig. 2A. In STAGE III of Algorithm 1, each subpopulation has a survey-weighted GLM trained on it, and the risk factor's regression coe cient and p-value are extracted. Due to the large number of hypothesis tests, false discovery correction is applied to these p-values before assessing signi cance at a threshold speci ed in the study's parameters cartridge (here, = 0:02).

5 Visit: https://github.com/TheRensselaerIDEA/hao-ontology

Cartridge type Contents

TC is a continuous response variable; subjects' age, Body Mass Index (BMI), Response Poverty Income Ratio (PIR), smoking habits, drinking habits, gender, marital status, and education level should be controlled for Cohort All available NHANES subjects Risk factor 201 environmental exposure risk factors divided into 17 categories

Standardize risk factor measurements; train models with M = 1; 2 and 3 cadres Parameters and choose best one using BIC for model selection; signi cance threshold of = 0:02 for GLM hypothesis tests Table 3. Chosen input cartridges for TC risk study. The prediabetes risk study uses the same cohort, risk factor, and parameters cartridge, with a di erent response cartridge. We have presented a semantically-targeted analytics framework via which risk factors speci c to a subpopulation may be discovered in datasets. With the supervised cadre machine learning method, we simultaneously discover subpopulations and identify their signi cant risk factors. To support this we built a novel Health Analysis Ontology that captures analytics and health domain knowledge.

HAO and other ontologies provide structure for de ning cartridges that are used for modular analysis pipelines. We leverage this semantic modeling to dynamically construct and execute a risk model and interpret results. Using STA, the system provides explainable insights for future population health studies in a scienti cally rigorous and reproducible way.

In STA, statistical ndings and parameters are encoded in results cartridges and written back to the KG, enabling retrieval for further study. Cartridges provide semantic extensions that enable a KG system to apply inference to solve domain-speci c analytics problems. By publishing the results cartridges, studies become reproducible and explainable with provenance. Researchers with new analysis methods can readily compare results with prior studies, using the same work ow on the same problems. They can adapt existing peer-reviewed studies to new diseases by editing cartridge in published work ows.

We report here on semantically-targeted analytics applied to population health studies that rapidly enables new ndings from the ongoing NHANES database. Using new cartridges, STA can readily be adapted to other types of statistical analysis on other data sources such as electronic health care records.

Acknowledgements

Created with support by IBM Research AI through the AI Horizons Network.

1. Al-Baltah , I.A. , Ghani , A.A.A. , Rahman , W.N.W.A. , Atan , R.: A classi cation of semantic con icts in heterogeneous web services at message level . Turkish Jnl of Electrical Engineering & Computer Sciences ( 2016 )

2. Aminov , Z. , Haase , R.F. , Pavuk , M. , Carpenter , D.O. : Analysis of the e ects of exposure to polychlorinated biphenyls and chlorinated pesticides on serum lipid levels in residents of Anniston, Alabama . Environ Health 12 , 108 (Dec 2013 )

3. Centers for Disease Control and Prevention (CDC): National Health and Nutrition Examination Survey ( 2017 ), http://www.cdc.gov/nchs/nhanes/

4. Courtot , M. , Gibson , F. , Lister , A.L. , Malone , J. , Schober , D. , Brinkman , R.R. , Ruttenberg , A. : Mireot: The minimum information to reference an external ontology term . Applied Ontology 6 ( 1 ), 23 { 33 ( 2011 )

5. Frank , J.W. : Why \population health" ? Canadian Journal of Public Health 86 ( 3 ), 162 ( 1995 )

6. Garijo , D. : Widoco: a wizard for documenting ontologies . In: International Semantic Web Conference . pp. 94 { 102 . Springer ( 2017 )

7. Gietz , W.: What is a data cartridge? In: Data Cartridge Developer's Guide, chap . 1, pp. 1 { 17 . Oracle Corporation ( 2002 )

8. Heeringa , S. , West , B. , Berglund , P. : Applied Survey Data Analysis . Chapman and Hall/CRC., 2 edn . ( 2017 )

9. Knublauch , H. , Fergerson , R.W. , Noy , N.F. , Musen , M.A. : The protege owl plugin: An open development environment for semantic web applications . In: International Semantic Web Conference. pp. 229 { 243 . Springer ( 2004 )

10. Li , Y. , Zhang , Y. , Wang , W. , Wu , Y. : Association of urinary cadmium with risk of diabetes: a meta-analysis . Environmental Science and Pollution Research 24 ( 11 ), 10083 {10090 (Apr 2017 )

11. Lloyd , J.R. , Duvenaud , D. , Grosse , R. , Tenenbaum , J.B. , Ghahramani , Z. : Automatic construction and natural-language description of nonparametric regression models . In: Proc. of the Twenty-Eighth AAAI Conf . pp. 1242 { 1250 ( 2014 )

12. Martin , E. , Monge , A. , Duret , J.A. , Gualandi , F. , Peitsch , M.C. , Pospisil , P. : Building an R & D chemical registration system . J Cheminform 4 ( 1 ), 11 (May 2012 )

13. New , A. , Bennett , K.P.: A precision environment-wide association study of hypertension via supervised cadre models . IEEE Journal of Biomedical and Health Informatics ( 2019 ), to appear

14. New , A. , Breneman , C. , Bennett , K.P. : Cadre modeling: Simultaneously discovering subpopulations and predictive models . In: 2018 Intl. Joint Conf. on Neural Networks (IJCNN) . pp. 1 { 8 ( July 2018 )

15. Nural , M.V. , Cotterell , M.E. , Peng , H. , Xie , R. , Ma, P. , Miller , J.A. : Automated Predictive Big Data Analytics Using Ontology Based Semantics. Int J Big Data 2 ( 2 ), 43 {56 (Oct 2015 )

16. Patel , C.J. , Bhattacharya , J. , Butte , A.J.: An environment-wide association study (EWAS) on type 2 diabetes mellitus . PLOS ONE 5 ( 5 ), 1 { 10 (05 2010 )

17. Pathak , J. , Johnson, T.M., Chute , C.G. : Survey of modular ontology techniques and their applications in the biomedical domain . Integr Comput Aided Eng 16 ( 3 ), 225 {242 (Aug 2009 )

18. Patterson , E. , Baldini , I. , Mojsilovic , A. , Varshney , K.R. : Teaching machines to understand data science code by semantic enrichment of data ow graphs . CoRR abs/ 1807 .05691 ( 2018 )

19. Paulheim , H.: Knowledge graph re nement: A survey of approaches and evaluation methods . Semantic Web 8 , 489 { 508 ( 2017 )

20. Shearer , R. , Motik , B. , Horrocks , I. : Hermit: A highly-e cient owl reasoner . In: Owled . vol. 432 , p. 91 ( 2008 )

21. Steinrucken , C. , Smith , E. , Janz , D. , Lloyd , J. , Ghahramani , Z. : The automatic statistician . In: Automatic Machine Learning: Methods, Systems , Challenges. pp. 175 { 188 ( 2018 )

22. Yao , Q. , et al.: Taking human out of learning applications: A survey on automated machine learning ( 2018 ), https://arxiv.org/abs/ 1810 .13306

23. Zhang , S.H. , et al.: Phthalate exposure and high blood pressure in adults: a crosssectional study in China . Env. Sci. and Pollution Research 25 ( 16 ), 15934 {15942 (Jun 2018 )

24. Zou , H. , Hastie , T. : Regularization and variable selection via the elastic net . Journal of the Royal Stat. Society: Series B 67 ( 2 ), 301 { 320 ( 2005 )