=Paper=
{{Paper
|id=Vol-2164/paper4
|storemode=property
|title=Knowledge Engineering Framework to Quantify Dependencies between Epidemiological and Biomolecular Factors in Breast Cancer
|pdfUrl=https://ceur-ws.org/Vol-2164/paper4.pdf
|volume=Vol-2164
|authors=Iuliia Innokenteva,Richard Hammer,Dmitriy Shin
|dblpUrl=https://dblp.org/rec/conf/semweb/InnokentevaHS18
}}
==Knowledge Engineering Framework to Quantify Dependencies between Epidemiological and Biomolecular Factors in Breast Cancer==
<pdf width="1500px">https://ceur-ws.org/Vol-2164/paper4.pdf</pdf>
<pre>
   Knowledge Engineering Framework to Quantify
Dependencies between Epidemiological and Biomolecular
               Factors in Breast Cancer

                 Iuliia Innokenteva1, Richard Hammer2,1 and Dmitriy Shin2,1
          1
              MU Informatics Institute, 2Department of Pathology and Anatomical Sciences
                   University of Missouri, 1 Hospital Dr. Pathology, Med Sci Bldg,
                                     Columbia, MO, 65203, USA
                                *Email: shindm@health.missouri.edu

   Abstract. The relationship between social determinants of health (SDoH) and chronic dis-
ease risks is crucial for its prevention. Such associations are relatively easier to uncover for sim-
ple diseases such as obesity or heart diseases. But for complex diagnoses like cancer, a large
number of factors contribute to the onset of the disease. For instance, there is increasing evi-
dence that biomolecular factors of cancer can be influenced by behavioral and environmental
patterns. For example, several subtypes of breast cancer that respond to different hormonal ther-
apies can arise due to different lifestyle, social, physiological risk-factors. Cancer Registries and
EHRs as the sources of health data are used widely in epidemiological research. Being collected
by health professional, the EHR data reduce research cost and embraces the whole population.
However, the primary purpose of those records is not being used in a research. Therefore, data
adjusting issue can arise. Often the structure of records is not satisfying to build an epidemiolog-
ical model. To fit data from EHR and Cancer Registry to epidemiological model we propose the
method of knowledge engineering to construct Bayesian Networks (BN) structure using control
vocabularies. Specifically, we selected fields from records and used National Institute of Cancer
Thesaurus to determine nodes for BN structure. We demonstrate utility of this approach on a co-
hort of University of Missouri Hospital (UMH) patients who was diagnosed with breast cancer.

        Keywords: Breast Cancer, Epidemiology, Controlled Vocabulary, Ontology,
        Bayesian Network, Knowledge Engineering, Biomolecular factors, Hormone
        Receptors.


1       Introduction

  Recently, EHRs has been used largely as a source of health data for epidemiological
research. Readily available data collected in accordance with health facilities’ standards
help reducing research costs and saving time. Systematic review made by Casey et al.
shows that extract, transform, load (ETL) tool is mainly used to make health data suit-
able for researchers (Casey, et al., 2016). Different common data models (CDM) such
as Observational Medical Outcomes Partnership (OMOP), FDA Sentinel Initiative, and
the Patient Centered Outcome Research Network (PCORNet) are based on ETL ap-
proach (Califf, 2014; Carnahan, et al., 2014; Kahn, et al., 2012). CDMs listed above
2

aim to integrate and adjust health data from diverse sources such as health care provid-
ers, pharmacies, laboratories, etc. The adjustment part of CDM technique is to bring
the data to the same consistent format using controlled vocabularies (e.g., same variable
names, attributes, etc.) (Resnic, et al., 2015). However, it is not well suited for the se-
lection of pertinent variables to design specific research studies, especially in the do-
main of complex diseases, such as cancer.
   According to the World Health Organization (WHO), one-third of all cancer cases
can be prevented by having dietary changes, stopping from smoking, getting hepatitis
vaccinations, and exercising regularly. Breast cancer is the most commonly diagnosed
cancer worldwide and particularly in the USA (WHO, 2016). Still, it is one of the can-
cer types which can be partially prevented by lifestyle modification (CDC,
2018). There are specific subtypes of breast cancer, which are characterized by different
hormonal patterns. The most commonly used at breast cancer diagnostic and treat-
ment hormone receptors are estrogen receptor (ER), progesterone receptor (PR),
and human epidermal growth factor receptor 2 (HER2). In accordance with their pres-
ence or absence in a body, breast cancer is divided into subtypes. For instance, luminal
cancer tends to be ER negative, basal-like breast cancer is usually triple negative and it
is the most challenging type of the disease.
   Association between lifestyle factors and those two breast cancer subtypes is shown
in Butler et al study. According to it, smoking is positively associated with luminal
cancer and almost does not effect basal-like cancer (Butler et al., 2016). Similar study
considered obesity as a risk-factor and a significant association between triple negative
breast cancer and obesity was found (Turkoz, et al., 2013). Smoking and ER-positive
cancer analysis showed that current smokers are more susceptible for ER-positive
breast cancer (Odds Ratio [OR]=1.6) than ever smokers (OR=1.4). But there was no
difference in terms of triple negative cancer risk in both groups (Kawai, Malone, Tang,
& Li, 2014). Statistical evidence of associations between obesity and ER-positive can-
cer was proven in Nechuta et al. study. The same research showed strong correlation
between alcohol consumption and ER-positive breast cancer (Nechuta, et al.,
2016). However, some studies presented that higher body mass index increases breast
cancer risk independently on menopausal status and estrogen receptor (ER) expression
(Schirer, et al., 2013; Wada et al., 2014). Another interesting finding is that urban
women have higher incidence rates (IRR) of ER-positive breast cancer (IRR=3.36) than
rural women (Dey, et al., 2009). After reviewing literature described above, we
have determined potential risk factors for all subtypes of breast cancer. Smoking, alco-
hol consumption, obesity had been chosen as initial variables for our research. Addi-
tionally, we considered the most common comorbidities such as hypertension and dia-
betes as risk-factors.
   Combination of molecular biology approaches and epidemiology studies can help to
determine the causes of certain subtype of breast cancer. Bayesian Networks can be in-
strumental to model such processes. BN is a graphical model that represents relation-
ships between factors and their probabilities. The model is usually used for prediction
of disease risk depending on certain factors (Rosa, et al., 2015). Each variable is repre-
sented as a node of BN and it has several mutually exclusive instances. Changing in-
stances for independent variables and setting a dependent variable as a target we can
predict an outcome.
                                                                                                3

   Still, it is not a trivial process to select appropriate entities from a EHR to determine
nodes for a BN model. Specifically, there has to be a protocol to determine appropriate
level of granularity for those entities. For instance, several fields in EHR system might
have to be aggregated to represent a node in BN model.
   To address this problem, we aim to create a knowledge engineering framework uti-
lizing controlled vocabularies such as ontologies and thesauri. Determined through
such a process BN nodes are then connected in a structure to compute conditional prob-
abilities. Then the BN model can be used to quantify and predict factors that influ-
ences hormonal patterns of breast cancer, which can lead to better patient care.


2       Methods

    The pipeline of knowledge engineering process is shown on Figure 1.


     Fig. 1. Knowledge engineering process for prediction of breast cancer hormonal patterns.

    Data are selected from EHR and Cancer registry based on epidemiological
knowledge about breast cancer. There is number of possible factors contributing to an
onset of breast cancer including demographic, socio-economic, physiological, and
mental factors. For the given research we used risk factors that were available in UMH
EHRs and Cancer Registry records. Ontology is used to determine which EHR fields
can be aggregated. We used the National Cancer Institute (NCI) Thesaurus to select po-
tential cancer risk factors that later could be retrieved from EHR. For example, accord-
ing to NIC Thesaurus, variables ‘Type 1 Diabetes Mellitus’ and ‘Type 2 Diabetes Melli-
tus’ are the child concepts of ‘Diabetes Mellitus’ concept. Thus, depending on epide-
miological context those variables can be aggregated in one BN node “Diabetes Melli-
tus’ with possible values ‘Type 1’, ‘Type 2’, ‘Undefined Diabetes’, ‘No History of Di-
abetes’.
   To generate the BN structure, we used epidemiological knowledge and literature re-
view presented in the introduction. In addition to history of obesity, tobacco and alco-
hol consumption we included comorbidities such as diabetes and hypertension. We
added a race variable as well to make the causality pattern more representative. Gener-
ated BN structure and its nodes are shown in Figure 2.
4


                             Fig. 2. Expert-based BN structure

   For the generated structure, we learned parameters from the dataset of 980 patients
of UMH with diagnosed breast cancer. Then setting ‘Hormonal_Pattern’ node as a tar-
get and setting different values for other nodes we could simulate cases and predict
hormonal pattern of breast cancer depending on behavioral, health, and social factors
(Figure 3).
                                                                                     5


                                  Fig. 3. Example of simulation


3      Results and Discussion

   Using UMH Cancer Registry data we determined a cohort of 1070 patients who were
diagnosed with breast cancer after 2013. Information about race, history of tobacco and
alcohol use, estrogen receptor (ER), progesterone receptor (PR), human epidermal
growth factor receptor 2 (HER2) was found from cancer registry data. Information
about history of obesity, diabetes, and hypertension was added from UMH EHRs. Hor-
monal patterns were defined as eight combinations of ER, PR, HER2 different values,
positive or negative (Table 1). During the data cleaning process 90 cases were removed
because of missing values.

                         Table 1. Combinations of hormonal patterns
         Hormonal Pat-     ER value (+/-)      PR value (+/-)     HER2 value (+/-)
         tern
            1                 +                   +                  +
            2                 +                   -                  -
            3                 +                   +                  -
            4                 +                   -                  +
            5                 -                   -                  -
            6                 -                   +                  +
            7                 -                   -                  +
            8                 -                   +                  -
      6

        Table 2 contains randomly selected five cases with different values of nodes. The
      results of simulation for five given cases are presented on the Table 2 as probabilities
      of different combinations of ER, PR, HER2 values.

                                   Table 2. Cases for simulation process
  Case#     Race     History_of_        History_of_    Hyperten-        Alcohol         Tobacco
                     Obesity            Diabetes       sion             History         History
  1         Black    Yes                Type 2         Yes              Current         Current
  2         White    No                 No_History     No               Never           Never
  3         Black    No                 No_History     No               Never           Never
  4         Black    No                 Type 1         Yes              Never           Previous
  5         White    Yes                Type 2         Yes              Current         Current


                      Table 3. Probabilities of hormonal pattern for simulation cases
Case# Probability Probability Probability Probability Probability Probability Probability Probability
      of Pattern1,of Pattern2,of Pattern3,of Pattern4,of Pattern5,of Pattern6,of Pattern7,of Pattern8,
      %           %           %           %           %           %           %           %

  1        50         50           0            0         0            0           0           0
  2        7          10           77           0         3            0           3           0
  3        0          31           56           0         8            0           5           0
  4        13         11           0            0         76           0           0           0
  5        50         50           0            0         0            0           0           0


         The results of simulating different cases with certain values show that some variables
      influence more than others on the ‘hormonal pattern’ outcome. Changing values one by
      one, we can see which of the nodes has the major effect on hormone receptors pattern.
      This approach can be used to predict a risk of certain subtype of breast cancer depend-
      ing on a variety of factors. In a best-case scenario, we could predict triple negative
      breast cancer risk which is the most challenging subtype of the disease in terms of re-
      sponse for a therapy.
         Using the knowledge engineering pipeline presented in the study, one can add varia-
      bles from different sources and aggregate them using ontology. For instance, EHRs
      contain patients’ addresses and it can be useful source of information in epidemiologi-
      cal sense. The thesaurus has a class called ‘Group’ which is then divided into ‘rural/un-
      derserved population’ and ‘urban population’. To extract this useful information,
      the nominal ‘address’ variable from the EHR needs to be modified to rural/urban cate-
      gorical variable. Then it can be included in epidemiological model to predict breast
      cancer subtype depending on patients’ residency which represents an access to health
      care.
         For the epidemiological model of breast cancer hormonal patterns, we did not include
      all possible predictors of the disease such as age, marital status, age at menarche, age
                                                                                       7

at menopause, number of pregnancies. The purpose of the model is to show the possi-
bility of utilizing the pipeline for certain population health research.
   Future research can be done to validate the results of this study. Using data analysis
statistical tools such as STATA one can analyze associations between nodes and find
evidence of statistical significance.


4      Conclusion

   To address the problem of determining the granularity of the data entities from dif-
ferent sources, we created a knowledge engineering pipeline. By utilizing the pipeline,
we could modify types of information from EHR and Cancer Registry records using a
controlled vocabulary such as NIC Thesaurus. We converted those variables into useful
for epidemiological models form. The utilization of this pipeline is not limited by can-
cer epidemiology purposes only. It can be used for other population health research
aimed to study health care access, behavioral patterns, treatment or public health pro-
gram effectiveness, and many other aspects.


    References
Butler, E. N., Tse, C.-K., Bell, M. E., Conway, K., Olshan, A. F., & Troester, M. A.
      (2016). Active smoking and risk of Luminal and Basal-like breast cancer
      subtypes in the Carolina Breast Cancer Study. Cancer Causes & Control : CCC.
      https://doi.org/10.1007/s10552-016-0754-1
Califf, R. M. (2014). The Patient-Centered Outcomes Research Network. North Caro-
       lina Medical Journal,75(3), 204-210. doi:10.18043/ncm.75.3.204
Carnahan, R. M., Bell, C. J., & Platt, R. (2014). Active Surveillance: The United States
       Food and Drug Administrations Sentinel Initiative. Manns Pharmacovigi-
       lance,429-437. doi:10.1002/9781118820186.ch2
Casey, J. A., Schwartz, B. S., Stewart, W. F., & Adler, N. E. (2016). Using Electronic
       Health Records for Population Health Research: A Review of Methods and Ap-
       plications. Annual Review of Public Health,36, 61-81.
Centers for Disease Control and Prevention. Breast Cancer. (2018, May 22). Retrieved
      from https://www.cdc.gov/cancer/breast/index.htm
Dey, S., Soliman, A. S., Hablas, A., Seifeldin, I. A., Ismail, K., Ramadan, M., . . . Me-
      rajver, D. (2009, June 23). Urban–rural differences in breast cancer incidence by
      hormone receptor status across 6 years in Egypt. Retrieved from
      https://link.springer.com/article/10.1007/s10549-009-0427-9,
      https://doi.org/10.1007/s10549-009-0427-9
Kahn, M. G., Batson, D., & Schilling, L. M. (2012). Data Model Considerations for
      Clinical           Effectiveness          Researchers. Medical           Care,50.
      doi:10.1097/mlr.0b013e318259bff4
8

Kawai, M., Malone, K. E., Tang, M.-T. C., & Li, C. I. (2014). Active smoking and the
      risk of estrogen receptor-positive and triple-negative breast cancer among women
      ages       20      to    44      years.       Cancer,      120(7),      1026–1034.
      https://doi.org/10.1002/cncr.28402
Nechuta, S., Chen, W. Y., Cai, H., Poole, E. M., Kwan, M. L., Flatt, S. W., … Ou Shu,
      X. (2016). A pooled analysis of post-diagnosis lifestyle factors in association with
      late estrogen-receptor-positive breast cancer prognosis. International Journal of
      Cancer. https://doi.org/10.1002/ijc.29940
Resnic, F., Robbins, S., Denton, J., Nookala, L., Meeker, D., Ohno-Machado, L., . . .
     Fitzhenry, F. (2015). Creating a Common Data Model for Comparative Effective-
     ness with the Observational Medical Outcomes Partnership. Applied Clinical In-
     formatics,06(03), 536-547. doi:10.4338/aci-2014-12-cr-0121
Rosa, C. M. I., Simões, P. W., Doneda, G., Silva, D., Moretti, G. P., Simon, C. S., …
      Rosa, M. I. (2015). Meta analysis of the use of Bayesian networks in breast cancer
      diagnosis. Cad. Saúde Pública, 31(311), 26–3826. https://doi.org/10.1590/0102-
      311X00205213
Schairer, C., Li, Y., Frawley, P., Graubard, B. I., Wellman, R. D., Buist, D. S. M., …
      Miglioretti, D. L. (2013). Risk factors for inflammatory breast cancer and other
      invasive breast cancers. Journal of the National Cancer Institute.
      https://doi.org/10.1093/jnci/djt206
Turkoz, F. P., Solak, M., Petekkaya, I., Keskin, O., Kertmen, N., Sarici, F., …
      Altundag, K. (2013). Association between common risk factors and molecular
      subtypes in breast cancer patients. The Breast, 22(3), 344–350.
      https://doi.org/10.1016/J.BREAST.2012.08.005
Wada, K., Nagata, C., Tamakoshi, A., Matsuo, K., Oze, I., Wakai, K., … Research
      Group for the Development and Evaluation of Cancer Prevention Strategies in
      Japan. (2014). Body mass index and breast cancer risk in Japan: a pooled analysis
      of eight population-based cohort studies. Annals of Oncology : Official Journal
      of the European Society for Medical Oncology / ESMO.
      https://doi.org/10.1093/annonc/mdt542
World Health Organization. Breast cancer: Prevention and control. (2016, January 21).
       Retrieved from http://www.who.int/cancer/detection/breastcancer/en/

</pre>