Ontology-enabled Analysis of Study Populations Shruthi Chari1[0000−0003−2946−7870] , Miao Qi1[0000−0002−2917−0965] , Nkechinyere N. Agu1[0000−0003−1386−8602] , Oshani Seneviratne1[0000−0001−8518−917X] , James P. McCusker1[0000−0003−1085−6059] , Kristin P. Bennett1[0000−0002−8782−105X] , Amar K. Das2[0000−0003−3556−0844] , and Deborah L. McGuinness1[0000−0001−7037−4567] 1 Rensselaer Polytechnic Institute, Troy, NY 12180, USA 2 IBM Research, Cambridge, MA Abstract. We address the problem of modeling study populations in research studies in a declarative manner. Research studies often have a great degree of variability in the reporting of population descriptions. To make study populations easily accessible for decision making related to study applicability, we will show the usage of our ontology-enabled pro- totype system in different applications. Our system leverages our Study Cohort Ontology and the related cohort Knowledge Graph (as described in our accepted resource track paper). We aim to address three retrospec- tive population analysis scenarios, designed to specifically determine the study match, study limitations, and evaluate the study quality. We also provide visualizations of a patient (or patient population) to a treatment arm. In addition, for each guideline recommendation that depends upon a study, we provide a summary of the relevant study’s cohort description. We describe some of our applications and their potential impacts. Resource Website: https://tetherless-world.github.io/study-cohort-ontology/ Keywords: Ontology Development · Analytics supported by Knowl- edge Graphs · Determination of Study Applicability 1 Introduction 3 Treatment recommendations in Clinical Practice Guidelines (CPG) are often supported by evidence from clinical trials and observational case studies (collec- tively referred to as research studies). When medical practitioners are determin- ing whether a study applies to their patient, they may consider the similarity of the study population to their patient. Characteristics of population descrip- tions are reported in tabular formats, often in the first table of research studies, more popularly called Table 1s. These Table 1s contain summarized descriptive statistics of characteristics (e.g., demographics, and anthropometric properties) for a set of study subjects belonging to treatment arms in the study. We devel- oped the Study Cohort Ontology (SCO), reusing terms from existing biomedical 3 Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 S. Chari et al. ontologies to build a conceptualization of the study subject collections, the as- sociated characteristics and the descriptive statistics on them. Through RDF knowledge graphs (KGs) modeled on SCO, we represented the Table 1s of 20 re- search studies that are cited in the Pharmacologic Interventions (Chapter 8) and the Cardiovascular Complications (Chapter 9) of the American Diabetes Associ- ation (ADA) Standards of Medical Care 2018 CPG.4 Our ontology development and KG modeling approach are described in greater detail in our resource track paper, and in this poster we elaborate on the applications enabled by our se- mantic modeling. Through the suggestions and validation of the medical practitioner on our team, we design and address three scenarios of clinical relevance by our ontology- enabled system: (1) study match - determine if a study population is similar to a given patient, (2) study limitation - expose population underrepresentations, and (3) study quality evaluation - analyze Table 1s to check for conformance to required best practices. Additionally, through KGs, we support cohort similarity visualizations, in which we overlay patient records against the treatment arms, that serve as quick comparisons for a study match. SCO also contributes to the larger goal of the Health Empowerment by Analytics, Learning, and Semantics (HEALS) project.5 In HEALS, we are also developing a guideline provenance (G-Prov) ontology to model the provenance behind guideline recommendations. We depict a use case integrating the two ontologies to summarize population descriptions of research studies backing a guideline recommendation. 2 Related Work Existing ontologies (e.g., [3,2]) for scientific literature (medical in particular) have been largely focused on addressing study design methods and do not specifi- cally address cohort modeling scenarios. ProvCaRe, an “Ontology for provenance + healthcare research"[3], has some level of support for study data, which is limited to the study inclusion and exclusion criteria. However, their vocabulary doesn’t handle the granularity and associations necessary to model the character- istics and descriptive statistics recorded on study populations. Through SCO, we support the modeling of aggregations on study populations at a disease- agnostic level. Additionally, our semantic modeling of baseline characteristics of populations enable retrospective population analyses and other patient matching capabilities, as described in section 3. 3 Applications 3.1 Population Analysis Scenarios Each of our population analysis scenarios (introduced in section 1) are imple- mented by SPARQL queries to our cohort KGs. On our resources website, we 4 View the ADA 2018 CPG at: https://diabetesed.net/wp-content/uploads/2017/ 12/2018-ADA-Standards-of-Care.pdf 5 HEALS: https://idea.tw.rpi.edu/projects/heals Ontology-enabled Analysis of Study Populations 3 have an example SPARQL query for a competency question for each of the sce- narios. During our bottom-up approach to construct SCO and the cohort KGs, we found that we can broadly model Table 1s’ content as the modeling of col- lections (i.e., study arms or categorical variables such as race) of study subjects, their characteristics, and the descriptive statistics associated with characteris- tics recorded on these collections. These templates, that are elaborated on in our resource track paper, provide the flexibility to easily frame SPARQL queries to identify the study bias, ascertain study match and determine study quality, etc. We were able to find a general underrepresentation of older adults above 70 (≈ 50%) and a lack of large, clinical trial studies (with population size ≥ 1000). Our current set of competency questions serve as a proof of concept in our ability to draw interesting and medically relevant conclusions about study populations that can assist medical practitioner. We are working on adding more competency questions to each class of population analysis scenario. 3.2 Cohort Similarity Visualizations Another use case includes our visualizations, star plots that serve as quick deter- miners of patient fit to a study and are generated on a per patient, and per study arm basis. These plots are generated by a Python script that triggers a SPARQL query to a Blazegraph endpoint to retrieve continuous variables of treatment arms from the cohort KG. Additionally, patient data for these same sets of vari- ables are also retrieved in the script. In the absence of actual EHR data, we evaluated using representative diabetic patients from the National Health and Nutrition Examination Survey (NHANES) dataset.6 We are exploring other vi- sualization strategies for categorical variables. As seen from the star plot in Fig. 1, we map the distributional spread of the variables from the treatment arms (i.e. mean +/- standard deviation, median and interquartile ranges) against pa- tient values for each of these variables. For example, on the age axis, we see that participants in the Metformin arm had an average age of 53 +/- 14 and the patient’s age (≈ 65) fell within this spread. 3.3 Visualizing the application of G-Prov and SCO In general, population health managers (PHM) aim to improve the overall health outcomes of a patient population by monitoring the features that affect their health. To assist the PHM in monitoring the health of their patient popula- tion, we integrate SCO with the G-PROV ontology. G-PROV is used to capture the provenance of CPG recommendations and link the provenance to the cited research studies that back the recommendations. We represented the Table 1s of studies in cohort KGs. Finally, we displayed the information using graphs, charts, and tables. This view (as seen in Fig. 1) provides a PHM with a quick way of visualizing the population descriptions within research studies backing CPG recommendations. 6 Dataset Information Page. https://wwwn.cdc.gov/nchs/nhanes/ continuousnhanes/default.aspx?BeginYear=2015 4 S. Chari et al. Fig. 1. A) Screenshot of the PHM view visualizing the statistical spread of two categor- ical variables, i.e., race and gender. B) Star plot overlaying a patient record against that of the Metformin study arm from the same "10-Year Follow-up of Intensive Glucose Control in Type 2 Diabetes"[1] study. 4 Conclusion We have introduced some applications of our ontology-enabled prototype system that supports analyses via standardized representations of cohort descriptions in KGs. Our applications address the use case of making evidence-based medicine resources accessible. In our case, study populations are made more accessible for medical practitioners dealing with the treatment of complex patients. We have released SCO as an open-source resource along with our documented use case applications. Additionally, we are expanding upon our ontology-enabled system to make it more scalable, including techniques for automatic extraction of cohort descriptions from studies and support for a larger array of applications. Acknowledgements This work is partially supported by IBM Research AI through the AI Horizons Network. We thank our colleagues from IBM Research, Dan Gruen, Morgan Foreman and Ching-Hua Chen, and from RPI, John Erickson, Alexander New, and Rebecca Cowan, who greatly assisted the research. References 1. Holman, R.R., Paul, S.K., Bethel, M.A., Matthews, D.R., Neil, H.A.W.: 10-year follow-up of intensive glucose control in type 2 diabetes. New England J. Medicine 359(15), 1577–1589 (2008) 2. Sim, I., Tu, S.W., Carini, S., Lehmann, H.P., Pollock, B.H., Peleg, M., Wittkowski, K.M.: The ontology of clinical research (ocre): an informatics foundation for the science of clinical research. J. Biomed. Informatics 52, 78–91 (2014) 3. Valdez, J., Kim, M., Rueschman, M., Socrates, V., Redline, S., Sahoo, S.S.: Provcare semantic provenance knowledgebase: evaluating scientific reproducibility of research studies. In: AMIA Annu. Symp. Proc. vol. 2017, p. 1705. Amer. Med. Inform. Assoc., Washington D.C., USA (2017)