=Paper=
{{Paper
|id=Vol-2429/paper2
|storemode=property
|title=Analysing the Heterogeneity of Rule-Based EHR Phenotyping Algorithms in CALIBER and the UK Biobank
|pdfUrl=https://ceur-ws.org/Vol-2429/paper2.pdf
|volume=Vol-2429
|authors=Spiros Denaxas,Helen Parkinson,Natalie Fitzpatrick,Cathie Sudlow,Harry Hemingway
|dblpUrl=https://dblp.org/rec/conf/ijcai/DenaxasPFSH19
}}
==Analysing the Heterogeneity of Rule-Based EHR Phenotyping Algorithms in CALIBER and the UK Biobank==
<pdf width="1500px">https://ceur-ws.org/Vol-2429/paper2.pdf</pdf>
<pre>
    Analyzing the heterogeneity of rule-based EHR phenotyping algorithms in
                        CALIBER and the UK Biobank

    Spiros Denaxas1,2,3, Helen Parkinson2,4, Natalie Fitzpatrick1,2,3, Cathie Sudlow2,5, Harry
                                           Hemingway1,2,3
                 1
                   Institute of Health Informatics, University College London, UK
                   2
                     Health Data Research UK London/Cambridge/Scotland, UK
                     3
                       UCL Hospitals Biomedical Research Center, London, UK
                         4
                           European Bioinformatics Institute, Cambridge, UK
  5
    Centre for Medical Informatics, Usher Institute of Population Health Science and Informatics,
                               University of Edinburgh, Edinburgh, UK
   s.denaxas@ucl.ac.uk, parkinso@ebi.ac.uk, n.fitzpatrick@ucl.ac.uk, Cathie.Sudlow@ed.ac.uk,
                                       h.hemingway@ucl.ac.uk


                          Abstract                                        implementation patterns will facilitate the design of
                                                                          a minimum information standard for representing
Electronic Health Records (EHR) are data                                  and     curating  algorithms       nationally    and
generated during routine interactions across                              internationally.
healthcare settings and contain rich, longitudinal
information on diagnoses, symptoms, medications,                      1   Introduction
investigations and tests. A primary use-case for
EHR is the creation of phenotyping algorithms                         In the United Kingdom (UK), structured electronic health
used to identify disease status, onset and                            records (EHR) spanning primary care, hospital care,
progression or extraction of information on risk                      disease/procedure registries and death registries are used to
factors or biomarkers. Phenotyping however is                         create longitudinal disease phenotypes for observational
challenging since EHR are collected for different                     research studies [Hemingway et al., 2018]. Through a
purposes, have variable data quality and often                        process called phenotyping, researchers create algorithms
require     significant    harmonization.    While                    which utilize multiple EHR sources to accurately extract
considerable effort goes into the phenotyping                         information on diseases (e.g. status, onset and progression),
process,    no     consistent    methodology    for                   lifestyle risk factors and biomarkers [Banda et al., 2018].
representing algorithms exists in the UK. Creating                    Phenotyping however is challenging due to the fact that
a national repository of curated algorithms can                       EHR are fragmented, curated using different controlled
potentially enable algorithm dissemination and                        clinical terminologies and collected for purposes other than
reuse by the wider community. A critical first step                   research (e.g. reimbursement, audit) [Morley et al., 2014].
is the creation of a robust minimum information
standard for phenotyping algorithm components                         Phenotyping requires a significant amount of resources and
(metadata, implementation logic, validation                           mix of expertise, yet no common standard approach for
evidence) which involves identifying and                              defining, validating and ultimately sharing EHR
reviewing the complexity and heterogeneity of                         phenotyping algorithms currently exists. In the UK,
current UK EHR algorithms. In this study, we                          structured primary care EHR have been used in >1,800
analyzed all available EHR phenotyping algorithms                     peer-reviewed studies to date but only 5% of studies
(n=70) from two large-scale contemporary EHR                          published sufficiently reproducible phenotypes [Springate et
resources in the UK (CALIBER and UK Biobank).                         al., 2014]. Defining a standardized format to represent EHR
We documented EHR sources, controlled clinical                        phenotypes will enable portability across data sources (and
terminologies, evidence of algorithm validation,                      healthcare systems) and facilitate the systematic sharing of
representation and implementation logic patterns.                     algorithms across the community [Mo et al. 2015].
Understanding the heterogeneity of UK EHR
algorithms       and       identifying    common
Copyright © 2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                             6
Compared to the United States (US), the UK EHR research         lifestyle risk factors from two large-scale contemporary UK
landscape differs in two important ways: 1) researchers can     research resources: UK Biobank1 and CALIBER2.
utilize multiple national EHR sources to create longitudinal
‘cradle to grave’ phenotypes [Kuan et al., 2019], and 2) UK     The UK Biobank [Sudlow et al., 2015] is a prospective
primary care EHR contain both healthy and unhealthy             cohort study of 500,000 (aged 40-69 at recruitment) adults
individuals which allow researchers to capture information      recruited in England, Scotland and Wales from 2006-2010.
on disease severity and progression over time. A recent         For each participant, deep phenotypic and genotypic
systematic review identified 66 different definitions used to   information is available including biomarkers in blood and
capture asthma status and exacerbations in research using       urine, imaging (brain, heart, abdomen, bone, carotid artery),
UK EHR [Al Sallakh et al., 2017] demonstrating significant      lifestyle indicators, pathophysiological measurements and
existing heterogeneity. While analyses have been                genome-wide genotype data. Follow-up for health outcomes
undertaken in the US to characterize the heterogeneity of       is enabled by hospital EHR (Hospital Episode Statistics
phenotyping algorithms [Conway et al., 2011], no such           (HES) in England, Patient Episode Data Warehouse in
analysis has been carried out in the UK.                        Wales and Scottish Morbidity Registry in Scotland) and
                                                                linkages to primary care EHR are underway. CALIBER
One of the aims of the newly-established national institute     [Denaxas et al., 2012; Denaxas et al., 2019] is a research
for health data science, Health Data Research UK (HDR           resource consisting of algorithms, tools and methods for
UK, www.hdruk.ac.uk), is the creation of a national             structured EHR linked across primary care (Clinical Practice
Phenomics Resource: an open-access online resource where        Research Datalink, CPRD), hospital care (HES) and a
EHR phenotypes can be deposited and curated. A critical         mortality data (Office for National Statistics, ONS) in the
first step in this process is to establish a minimum            UK.
information standard for representing EHR phenotyping
algorithms. This involves exploring and documenting the         In the UK, national EHR are recorded using controlled
complexity, heterogeneity, design and implementation            clinical terminologies where terms are assigned at variable
patterns of contemporary phenotyping algorithms in the UK.      timepoints i.e. in UK primary care the physician records
The concept of a minimum information standard has been          terms in real time during the consultation with the patient
used successfully in other biomedical disciplines, e.g.         whereas in hospital care terms are retrospectively entered
Minimum Information About a Microarray Experiment               into databases by trained coders and data selected for billing
(MIAME) defines standards for reporting microarray              purposes. We identified and counted the number of
experiments [Brazma et al., 2001]. Establishing a               ontology terms each algorithm utilizes from five controlled
standardized method for representing phenotypes in the UK       clinical terminologies which are widely used in the UK: a)
can potentially address these challenges and ensure             Read (primary care, subset of SNOMED-CT), b)
compatibility with other international initiatives such as      International Classification of Diseases 9th and 10th
eMERGE and PCORNet [Fleurence et al. 2014; Gottesman            Revision (ICD-9, ICD-10, secondary care diagnoses and
et al. 2013].                                                   cause of mortality), c) OPCS Classification of Interventions
                                                                and Procedures (OPCS-4, hospital surgical procedures,
2.   Aims                                                       analogous to the Current Procedural Terminology ontology
Despite the widespread use of UK EHR data sources for           used in the United States), and d) the Dictionary of
research, contemporary research resources utilize different     Medicines and Devices (DM+D) which is used to record
approaches for algorithm creation, curation and validation.     primary care prescriptions. Terms were automatically
The aims of this study were to: a) identify and characterize    extracted from documents and counted using regular
the structural components, implementation logic and             expressions in Python 3.63. We manually extracted and
heterogeneity of rule-based algorithms defining diseases,       counted terms across five randomly chosen algorithms to
lifestyle risk factors and biomarkers in structured national    verify the automatically-generated counts.
EHR in the UK utilized by contemporary research
resources, and b) propose a minimum information standard        EHR phenotype validation is a critical process guiding the
to represent UK EHR phenotyping algorithms.                     subsequent use of algorithms and we were interested in what
                                                                types, if any, of evidence were available to external
3.   Methods                                                    researchers. We classified the available material into six
We identified, downloaded and reviewed published                non-overlapping categories which encapsulate all potential
phenotyping algorithms for diseases, biomarkers and

                                                                1
                                                                  http://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=42
                                                                2
                                                                  https://www.caliberresearch.org/portal/phenotypes
                                                                3
                                                                  https://www.python.org/


                                                                                                                          7
approaches for obtaining validity evidence (adapted from         Temporal        Complex         ≥3 high SBP/DBP
[Denaxas et al, 2019] and recorded as used/not used):            (complex)       temporal        readings within 1-year
• Aetiological: Are the prospective associations with risk                       rules.          OR ≥2 high SBP/DBP
    factors consistent with previous published evidence                          multiple        readings in a 6-month
    from both EHR and non-EHR studies?                                           logic layers    period
• Prognostic: Are the risks of subsequent events                 Biomarker       Evidence        Presence of a positive
    plausible and consistent with existing domain                                from            rheumatoid factor test or
    knowledge?                                                                   continuous      anti-cyclic citrullinated
• Case-note review: What is the positive predictive value                        measureme       peptide antibody test
    (PPV) and the negative predictive value (NPV) when                           nt              after a rheumatoid
    comparing the algorithm with clinician-led review of                                         arthritis diagnosis
    case notes, self-reported information or a suitable “gold    Complex         Calculation     Calculate average BMI
    standard” source?                                            calculation     e.g. unit       in consultation, exclude
• Cross-EHR-source concordance: To what extent is                                conversion      measurements <10
    the phenotype concordant across EHR sources?                                                 kg/m2 or >80 kg/m2
• Genetic: Are the observed genetic associations
    plausible and consistent in terms of magnitude and           Table 1: Characteristics of implementation logic,
    direction of association with associations reported from     temporality and algorithmic implementation features
    non-EHR studies?                                             extracted and analyzed from phenotyping algorithms in
• External populations: Has the algorithm been                   the UK Biobank and the CALIBER resources. AF Atrial
    evaluated in different countries or external sources?        Fibrillation; BMI Body Mass Index; BP Blood Pressure;
                                                                 DBP Diastolic Blood Pressure; DVT Deep Vein
For each algorithm, we documented the EHR sources the            Thrombosis; PVD Peripheral Vascular Disease; PE
phenotype is derived from (i.e. primary care, hospital care,     Pulmonary Embolism; HT Hypertension; HF Heart Failure;
mortality register). We extracted information on the             mmHg millimeter of mercury; SBP Systolic Blood Pressure.
representation components of phenotypes e.g. the presence
of tabular data and the use of a flowchart (or other graphical   4.    Results
presentation). We extracted and categorized information on       We identified and reviewed 70 EHR phenotyping (Table 2)
the different types of implementation logic, temporality and     algorithms available from the UK Biobank (n=19) and the
algorithm implementation patterns (Table 1), partially based     CALIBER resource (n=51). The majority of phenotyping
on previous research in the US [Conway et al., 2011].            algorithms were created to ascertain disease status (n=54)
Concept          Definition     Example                          (e.g. heart failure [Gho et al. 2018; Uijl et al. 2019],
Simple           Simple         PVD diagnosis during a           depression [Daskalopoulou et al. 2016]), ten algorithms
Boolean          Boolean        primary care                     were created to extract information on biomarkers (e.g.
                 statements     consultation OR                  heart rate [Archangelidi et al. 2018], blood pressure
                 e.g.           diagnosis of leg or aortic       [Rapsomaniki et al. 2014]) and six algorithms were used to
                 “AND”,         embolism or thrombosis           identify lifestyle risk factors (e.g. alcohol [Bell et al. 2017],
                 “OR”           during a hospitalization         smoking [Pujades-Rodriguez et al. 2015]).
Complex          Nested         IF patient = diabetic: HT
Boolean          statements     threshold: SBP ≥140              All but one CALIBER phenotyping algorithm (n=50) used
                 with           mmHg OR DBP ≥90                  information from primary care EHR with the exception of
                 multiple       mmHg ELSE: threshold             socioeconomic status which was defined using the Index of
                 layers?        SBP ≥150 mmHg OR                 Multiple Deprivation (IMD) provided by the ONS.
                                DBP ≥90 mmHg                     Algorithms defining biomarker measurements (e.g. white
                                                                 blood cells, heart rate) were based on primary care EHR
Negation         Are            No AF diagnosis term is
                                                                 entirely while approximately half of the algorithms
                 negation       present, but the patient
                                                                 ascertaining disease status (n=19 of 35) combined
                 statements     record includes a
                                                                 information across all three EHR sources. All currently
                 used?          warfarin prescription in
                                                                 available UK Biobank algorithms (n=19) combined
                                the absence of prior
                                                                 information recorded during the baseline assessment (data
                                DVT or PE, or a digoxin
                                                                 not shown), diagnoses and/or surgical procedures recorded
                                prescription but no HF
                                                                 during hospitalization and information based on the
Temporal         Temporal       Iron deficiency anaemia
                                                                 underlying (or secondary) cause of death which is recorded
(simple)         proximity      record in primary care
                                                                 in the national mortality register. Primary care linkages in
                 future or      OR hospital AND
                                                                 UK Biobank are still underway and as a result none of the
                 past           endoscopy in 30 days


                                                                                                                           8
currently available algorithms utilized information from          Prognostic 86% (n=66) and cross-source concordance 54%
primary care EHR. However, primary care information for           (n=43) validation approaches where the most widely-used
just under half of the cohort (n=230,000) will be made            algorithm evaluation approaches. The least-widely used
available for UK Biobank researchers in June 2019.                validation approach was expert case note review, although
Algorithms incorporating primary care data for the                this type of validation has been completed for a few UK
conditions already covered have been or are being                 Biobank algorithms, including dementia and its subtypes
developed [Wilkinson et al 2019]. Along with a range of           [Wilkinson et al, 2019], and is underway for several others.
additional algorithms expanding the range of health               Most (93% [n=66]) of the algorithms used data stored in
outcomes available, they will be available from UK Biobank        tabular format since tables are predominantly used to store
later in 2019. Overall, based on current publicly available       lists of controlled clinical terminology terms. Only 25%
information from CALIBER and UK Biobank, 75% (n=66)               (n=15) of algorithms included a graphical representation of
of algorithms used data from secondary care EHR and 45%           the algorithm using a flowchart and all algorithms included
(n=49) used information available in the death registry.          a textual description of the algorithm components.

The most widely-used clinical terminology was Read with
4,729 (non-unique) terms used across all algorithms while         5. Discussion
the second highest number of terms was derived from the           In this study we downloaded and reviewed 70 EHR
DM+D with 2,273 (non-unique) terms used to record                 phenotyping algorithms from two large-scale, national
prescriptions in primary care EHR. Four algorithms (body          research resources in the UK. We reviewed algorithms in
mass index, socioeconomic deprivation, sex, heart rate) did       terms of EHR data sources, controlled clinical terminologies
not use any terms across any terminology systems and were         used, available evidence of algorithm validation, algorithm
based on information which is derived from a structured           representation formats and implementation logic patterns.
field of the EHR or externally linked such as in the case of
IMD. The atrial fibrillation algorithm used the highest           Similar to findings from US studies, we discovered that UK
number of clinical terms (n=987) while across all algorithms      EHR algorithms make extensive use of Boolean statements
the pregnancy phenotype used the highest number of Read           and temporal logic. When these are used, they are often
codes (n=1,948). ICD-9 was the terminology least used: in         complex i.e. combining multiple nested Boolean layers of
the UK Biobank it is used for recording diagnoses in older        logic and defining temporal proximity rules within them.
Scottish hospital records and in CALIBER it is used to            This is expected given that algorithms utilize multiple
record the cause of death prior to 1997. Algorithms defining      sources of information and include evidence from primary
biomarkers contained the lowest number of terminology             care and hospital care (or self-reported information in the
terms as they relied on structured data fields combined with      case of the UK Biobank). Algorithms defining disease status
a small number of diagnosis terms to denote the type of test      were the most frequent and complex algorithms reviewed
(e.g. Read code “42K..00 Eosinophil count”).                      and utilized the greatest number of terms from controlled
                                                                  clinical terminologies. Negation was another major
With regards to algorithm implementation logic, 66 (93%)          component of algorithms and is often used to exclude
of algorithms used Boolean statements, usually to identify        concomitant diagnoses or procedures when trying to
the presence of one or more diagnosis codes in a patient’s        ascertain diseases based on secondary information (e.g.
EHR. Where Boolean statements were deployed, in nearly            ascertaining AF cases based on a prescription of digoxin but
half of the cases these were complex and involved either a        excluding patients which are diagnosed with HF).
series of nested statements or joined information across
multiple sources, for example in the UK Biobank where             The Read clinical terminology was the most popular
information is derived from self-reported, hospital and           terminology used with the highest number of terms per
mortality sources and events are further stratified as            phenotype. These findings are expected as Read contains a
‘prevalent’ (first reported prior to recruitment) or ‘incident’   significant amount of duplication internally due to synonym
(first reported after recruitment). A similar pattern of logic    terms which can be potentially utilized. Additionally, the
was observed with regards to temporality where 66                 clinical concepts contained within Read subsume the
algorithms utilized temporal rules and almost always this         concepts across all other terminologies i.e. Read contains
included more complex statements and restrictions. Finally,       terms for diagnoses, symptoms, laboratory tests,
approximately half (n=43) of the algorithms used negation.        prescriptions and procedures. UK primary care clinical
Only ten algorithms (16%) included more complex                   coding is currently transitioning to SNOMED-CT which
calculations, usually to calculate the mean of multiple           should provide a more streamlined set of terms to be used.
measurements on the same day or to harmonize units for
laboratory measurements to a common format.                       In terms of validation, we observed a significant level of
                                                                  heterogeneity with approaches seeking to evaluate and
                                                                  replicate previously reported aetiological and prognostic


                                                                                                                        9
estimates from non-EHR studies being the most popular.           implementation logic, validation evidence and use-cases.
The presentation of the evidence however does not follow a       We suggest the following components towards establishing
common standard and sometimes only included references           a minimum information standard with regards to rule-based
to published research rather than a more structured abstract     phenotyping algorithms for UK EHR:
of the main findings of the analyses. In contrast with the
US, expert review of case records was the least frequently       Part 1 – Algorithm metadata: Succinct information about
used approach for evaluation due to the fact that large scale    the goal of the algorithm, the intended use-case, the data
corpuses of medical text do not exist in the UK owing to         sources and controlled clinical terminologies used,
information governance restrictions and the technical            applicable age groups and genders, list of authors and their
challenges of integrating such data since they are held in a     contact details and a set of SNOMED-CT terms to classify
wide range of formats by multiple different NHS                  the algorithm. A unique identifier, such as a Digital Object
organisations. For similar reasons, none of the algorithms       Identifier (DOI), should be minted to enable usage tracking
reviewed utilize medical text and natural language               in subsequent research.
processing approaches to extract information from medical
notes which is prevalent in some clinical specialties such as    Part 2 – Implementation: Details on the implementation
mental health [Wu et al. 2018].                                  logic of the algorithm with pseudocode to facilitate the
                                                                 translation to machine code and documentation on decisions
Significant heterogeneity was also observed in terms of          made and reasoning. Where possible analytical scripts
representation. UK Biobank algorithms were curated in            should be attached using markdown or a similar approach.
individual PDF files4 and included extended information on       The standard should support defining complex Boolean and
the goal of the algorithm and useful background knowledge        temporal logic across multiple EHR sources and clinical
and references. In contrast, CALIBER phenotypes were             terminologies. In the future, a computable phenotype format
stored in an online, openly-available Portal5, spanned           should encapsulate this information as a stand-alone file.
multiple pages and did not include much background
information. Flowcharts or similar graphical representations     Part 3 – Validation evidence: Description of the steps
were not widely-used and while they are not machine-             taken to support phenotype validity across six categories
readable, they can potentially minimize errors during            (aetiological, prognostic, genetic, expert review, cross-
translation of the algorithm to machine code.                    source and external population). For each implementation,
                                                                 the number of cases, controls, NPV and PPV values should
Our study has potential limitations. We reviewed algorithms      be reported and the format should support the embedding of
from only two UK sources. While other UK initiatives exist,      graphical files (e.g. forest plots).
they tend to focus on curating lists of controlled clinical
terminology terms (referred to as codelists) rather than self-   Part 4 – Use-cases: Links to published research utilizing
contained phenotypes i.e. terms, implementation, validation      the phenotype algorithms, cross-referenced with DOI’s.
evidence. We only focused on rule-based approaches and
did not cover machine learning approaches. While rule-           7. Conclusion
based methods are the most widely used in the UK, data-
driven high-throughput approaches including natural              Our analyses identified a certain level of underlying
language processing methods are emerging [Zhou et al.,           homogeneity in terms of how phenotyping algorithms are
2016, Pikoula et al., 2019]. These approaches pose different     defined and evaluated. We suggest four components
challenges and their requirements would need to be               towards a minimum information standard that should be
documented and analysed in order to ensure their integration     used to represent phenotyping algorithms. These findings
[Hripcsak & Albers 2013]. Finally, reproducible research         provide a crucial first step towards curating and
approaches [Denaxas et al., 2017, Goodman et al, 2016]           disseminating phenotyping algorithms utilizing UK EHR.
which are covered elsewhere would also need to be                Further work is required towards establishing a computable
carefully taken into consideration in order to ensure            format for phenotyping algorithms and ensuring
algorithm portability.                                           interoperability with other resources (e.g. PheKB).

6. Steps towards a minimum information standard                  Acknowledgments
Based on our findings, we propose that an EHR                    This work was supported by Health Data Research UK,
phenotyping algorithm representation combines metadata,          which receives its funding from HDR UK Ltd (LOND1)
                                                                 funded by the UK Medical Research Council, Engineering
4
    http://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=42        and Physical Sciences Research Council, Economic and
5
    https://www.caliberresearch.org/portal                       Social Research Council, Department of Health and Social
                                                                 Care (England), Chief Scientist Office of the Scottish
                                                                 Government Health and Social Care Directorates, Health


                                                                                                                      10
and Social Care Research and Development Division                 [Denaxas et al., 2019] Denaxas, S., et al. UK phenomics
(Welsh Government), Public Health Agency (Northern                   platform for developing and validating EHR phenotypes:
Ireland), British Heart Foundation and the Wellcome Trust.           CALIBER. J Am Med Inf 10.1093/jamia/ocz105, 2019.
The BigData@Heart Consortium is funded by the                     [Denaxas et al. 2017] Denaxas, S. et al., Methods for
Innovative Medicines Initiative-2 Joint Undertaking under            enhancing the reproducibility of biomedical research
grant agreement No. 116074. This study was supported by              findings using electronic health records. BioData
the Farr Institute of Health Informatics Research at UCL             Mining, 10 (31), 2017.
Partners (MR/K006584/1). This paper represents
independent research part funded by the National Institute        [Fleurence et al., 2014] Fleurence, R., et al. Launching
for Health Research Biomedical Research Centre at UCLH.              PCORnet, a National Patient-Centered Clinical Research
HH is a NIHR Senior Investigator. SD is an Alan Turing               Network. JAMIA 21 (4): 578–82, 2014.
Fellow.                                                           [Gho et al. 2018] Gho, J. et al. An Electronic Health
                                                                     Records Cohort Study on Heart Failure Following
References                                                           Myocardial Infarction in England: Incidence and
[Al Sallakh et al. 2017] Al Sallakh, M. A., et al. Defining          Predictors. BMJ Open 8 (3): e018331., 2018.
   asthma and assessing asthma outcomes using electronic          [Goodman et al., 2016] Goodman, S.N., et al. What does
   health record data: a systematic scoping review. Eur.             research reproducibility mean? Science Translational
   Respiratory J., 49(6), 2017.                                      Medicine, 8(341), p.341ps12., 2016.
[Archangelidi et al., 2018] Archangelidi, O., et al. Clinically   [Gottesman et al., 2013] Gottesman, O., et al. “The
   Recorded Heart Rate and Incidence of 12 Coronary,                 Electronic Medical Records and Genomics (eMERGE)
   Cardiac, Cerebrovascular and Peripheral Arterial                  Network: Past, Present, and Future.” Genetics in
   Diseases in 233,970 Men and Women: A Linked                       Medicine 15 (10): 761–71, 2013.
   Electronic Health Record Study. Eur. J. of Preventive          [Hemingway et al., 2018] Hemingway, H., et al. Big data
   Cardiology 25 (14): 1485–95, 2018.                                from electronic health records for early and late
[Banda et al., 2018] Banda, J. M., et al. Advances in                translational cardiovascular research: challenges and
   Electronic Phenotyping: From Rule-Based Definitions to            potential. European Heart J., 39(16), 1481–1495, 2018
   Machine Learning Models. Annual Review of                      [Hripcsak & Albers, 2013] Hripcsak, G. & Albers, D.J.
   Biomedical Data Science 2018.                                     Next-generation phenotyping of electronic health
[Bell et al., 2017] Bell, S, et al. Association between              records. JAMIA, 20(1), 117–121, 2013.
   Clinically Recorded Alcohol Consumption and Initial            [Kuan et al., 2019] Kuan, V. et al. A chronological map of
   Presentation of 12 Cardiovascular Diseases: Population         308 physical and mental health conditions from 4 million
   Based Cohort Study Using Linked Health Records.                individuals in the English National Health Service. The
   BMJ 356: j909, 2017.                                           Lancet Digital Health 1(2), e63-e67. 2019.
Brazma et al., 2001] Brazma, A. et al., Minimum                   [Mo et al., 2015] Mo, H., et al., Desiderata for Computable
   information about a microarray experiment (MIAME)-               Representations of Electronic Health Records-Driven
   toward standards for microarray data. Nature Genetics,           Phenotype Algorithms, JAMIA 22 (6): 1220–30., 2015.
   29(4), 365–371. 2001.
                                                                  [Morley et al., 2014] Morley, K. et al., Defining disease
[Conway et al., 2011] Conway, M., et al. Analyzing the              phenotypes using national linked electronic health
   heterogeneity and complexity of Electronic Health                records: a case study of atrial fibrillation. PLOS ONE,
   Record oriented phenotyping algorithms. Proc. Am Med             9(11), e110900, 2014.
   Infor Assoc., 274–283, 2011
                                                                  [Pikoula et al., 2019] Pikoula, M. et al., Identifying
[Daskalopoulou et al., 2016] Daskalopoulou, M. et al.,               clinically important COPD sub-types using data-driven
   Depression as a Risk Factor for the Initial Presentation          approaches in primary care population based electronic
   of Twelve Cardiac, Cerebrovascular, and Peripheral                health records. BMC Medical Informatics and Decision
   Arterial Diseases: Data Linkage Study of 1.9 Million              Making, 19(1), p.86, 2019.
   Women and Men. PLOS ONE 11 (4): e0153838, 2016.
                                                                  [Pujades-Rodriguez et al.,2015] Pujades-Rodriguez, M. et
[Denaxas et al, 2012] Denaxas, S. et al. Data resource               al., Heterogeneous Associations between Smoking and a
   profile: cardiovascular disease research using linked             Wide Range of Initial Presentations of Cardiovascular
   bespoke studies and electronic health records                     Disease in 1937360 People in England: Lifetime Risks
   (CALIBER). Int. J. Epidemiology, 41(6), 1625–1638,                and Implications for Risk Prediction. Int. J. of
   2012.                                                             Epidemiology 44 (1): 129–41, 2015.


                                                                                                                      11
[Rapsomaniki et al., 2014] Rapsomaniki, E. et al. Blood                                                 Health Records. Eur.                                                 J.          Heart                Failure.,
   Pressure and Incidence of Twelve Cardiovascular                                                      10.1002/ejhf.1350, 2019.
   Diseases: Lifetime Risks, Healthy Life-Years Lost, and                                     [Wilkinson et al., 2019] Wilkinson T. et al., Identifying
   Age-Specific Associations in 1·25 Million People. The                                        dementia outcomes in UK Biobank: a validation study of
   Lancet 383 (9932): 1899–1911, 2014.                                                          primary care, hospital admissions and mortality data.
[Springate et al., 2014] Springate, D. et al., ClinicalCodes:                                   Eur Jour Epidemiology. 10.1007/s10654-019-00499-1.
   an online clinical codes repository to improve the                                           2019.
   validity and reproducibility of research using electronic                                  [Wu et al., 2018] Wu, H. et al., SemEHR: A general-
   medical records. PLOS ONE, 9(6), e99825, 2014.                                               purpose semantic search system to surface semantic data
[Sudlow et al., 2015] Sudlow, C., et al., UK Biobank: an                                        from clinical notes for tailored care, trial recruitment,
   open access resource for identifying the causes of a wide                                    and clinical research. JAMIA, 25(5), 530–537., 2018.
   range of complex diseases of middle and old age. PLOS                                      [Zhou et al,, 2016] Zhou, S.-M. et al., Defining Disease
   Medicine, 12(3), e1001779, 2015.                                                              Phenotypes in Primary Care Electronic Health Records
[Uijl et al., 2019] Uijl, A, et al., Risk Factors for Incident                                   by a Machine Learning Approach: A Case Study in
   Heart Failure in Age- and Sex-Specific Strata: A                                              Identifying Rheumatoid Arthritis. PLOS ONE, 11(5),
   Population-Based Cohort Using Linked Electronic                                               p.e0154515, 2016.


                     Source              Terminology                                          Validation                                  Format                                    Implementation


                                                                                                                                                                                                                       Biomarker
                                                                                                                                                    Flowchart


                                                                                                                                                                                                Temporal
                                                                      Prognosis


                                                                                                                   Case note
                                                                                  Aetiology


                                                                                                                                                                                     Negation


                                                                                                                                                                                                           Complex


                                                                                                                                                                                                                                   Complex
                                                                                                                               External

                                                                                                                                          Tabular


                                                                                                                                                                Boolean
                                                                                                         Genetic


                                                                                                                                                                                                                                   calculation
                                                                                                                                                                          Boolean
                                                               DM+D


                                                                                               Source
                                         ICD10


                                                        OPCS


                                                                                                                                                                                                            temporal
                                                 ICD9
                                  Read
                           MR
                PC
                      SC


CALIBER
AAA              +     +      +   32      6       6     46      0       +           +           +                                          +                     +                                +           +
AD               +     +      +   36     17       7      0      0       +                       +                                          +                     +         +                      +           +
AF               +     +      +   523     5       0     396    63       +           +           +         +                                +          +          +         +           +          +           +
Alcohol          +                141     0       0      0      0       +                                                                  +          +          +         +           +          +           +
AMI              +     +      +   43     18      14      2      0       +           +           +         +                      +         +          +          +         +           +          +           +
AU               +     +          38      6       0      0     15       +                       +                                          +          +          +         +           +          +           +
Bleeding         +     +      +   131    14       0     17      0       +           +           +                    +                     +                     +         +                      +           +          +
BMI              +                 0      0       0      0      0       +                                                                                                                         +                      +             +
BP               +                67      0       0      0      0       +                                                                  +                                                      +                      +             +
CHD              +     +      +   30      8       9      0      0       +           +           +                                          +                     +         +                      +
Dementia NS      +     +      +   36     17       7      0      0       +                       +                                          +                     +         +                      +           +
Depression       +     +          152    15       0      0      0       +                       +                                          +                     +                                +           +
Deprivation                        0      0       0      0      0       +
Diabetes         +     +          141     4       0      0      0       +           +                                                      +          +          +         +           +          +           +
Eosinophils      +                 4      0       0      0      0       +                                                                  +                     +         +           +          +           +          +
Ethnicity        +     +          104     0       0      0      0       +                                                                  +          +          +         +           +
GCA              +     +           7      1       0      0     18       +                       +                                          +                     +         +           +          +           +
Gender           +                 0      0       0      0      0       +
HCM              +     +          81      2       0     41     557      +                       +                                          +          +          +         +           +          +           +
HDL              +                 4      0       0      0      0       +                                                                  +                     +         +                      +           +          +             +
HF               +     +      +   93      6       9      0      0       +           +           +                                          +          +          +         +           +          +           +
HIV              +     +      +   35     25       0      0              +                       +                                          +                     +                                +           +
HR               +                 0      0       0      0      0       +                                                                                        +                                +           +          +             +


                                                                                                                                                                                                                  12
HT                  +    +        84     5     0    2     0    +     +    +                     +          +                 +   +        +   +
ICH                 +    +   +    17     1     1    0     0    +     +    +                     +          +                 +   +
Influenza           +             62     0     0    0     0    +                                +          +
Isch. Stroke        +    +   +    10     1     2    0     0    +     +    +                     +          +                 +   +
LDL                 +              5     0     0    0     0    +                                +          +                 +   +        +   +
Lymphocytes         +             10     0     0    0     0    +                                +          +                 +   +        +   +
MS                  +    +        10     1     0    0    15    +          +                     +     +    +     +       +   +   +        +
Neutrophils         +              6     0     0    0     0    +                                +          +                 +   +        +   +
Obesity             +    +   +    105    1     0    50    0    +          +                     +          +                 +
PAD                 +    +   +    201    6     5    71    0    +     +    +                     +          +     +           +   +
BuP                 +    +        25     6     0    0    286   +          +                     +     +    +     +       +   +   +
PBC                 +    +         4     1     0    0    21    +          +                     +     +    +     +       +   +   +
PMR                 +    +         3     2     0    0    90    +          +                     +          +     +       +   +   +
Pregnancy           +    +       1948    0     0    0     0    +                                +          +
Psoriasis           +    +        82     0     0    0    453   +          +                     +     +    +     +       +   +   +
RA                  +    +        75    18     0    0    72    +          +                     +     +    +     +       +   +   +        +
SA                  +    +        181    3     0    67   674   +     +    +                     +     +    +     +           +   +
SAH                 +    +   +    11     1     1    0     0    +     +    +                     +          +             +   +
SCD                 +    +   +    32     6     2    16    0    +     +    +                     +          +     +       +   +   +
Scleroderma         +    +         5     5     0    0     9    +          +                     +     +    +     +       +   +   +        +
Smoking             +             21     3     0    0     0    +                                +          +     +       +   +
Stroke NS           +    +   +    17     6     3    1     0    +     +    +                     +          +             +   +
TIA                 +    +   +    15     2          0     0    +     +    +                     +          +             +   +
Triglycerides       +              6     0     0    0     0    +                                +          +                 +   +        +   +
UA                  +    +        12     4     0    0     0    +     +    +                     +          +     +       +   +
UCD                 +    +   +    32     6     2    16    0    +     +                          +          +             +   +
VD                  +    +   +    36    17     7    0     0    +          +                     +          +     +           +   +
WBC                 +             16     0     0    0     0    +                                +          +                 +   +        +   +
UK Biobank6
AD                 n/a +     +    n/a   32     9    0    n/a   +                     +          +          +     +       +   +   +
AMI                n/a +     +    n/a   23    17    0    n/a   +                                +          +             +   +   +
Asthma             n/a +     +    n/a    6     6    0    n/a   +                                +          +             +   +   +
COPD               n/a +     +    n/a   11     4    0    n/a   +                                +          +             +   +   +
Dementia NS        n/a +     +    n/a   32     9    0    n/a   +                     +          +          +     +       +   +   +
ESRD               n/a +     +    n/a   18     0    37   n/a   +                     +          +          +     +       +   +   +
FTD                n/a +     +    n/a   32     9    0    n/a   +                     +          +          +     +       +   +   +
ICH                n/a +     +    n/a   32     7    0    n/a   +                                +          +             +   +   +
Isch. Stroke       n/a +     +    n/a   32     7    0    n/a   +                                +          +             +   +   +
MND                n/a +     +    n/a    1     1    0    n/a   +                                +          +             +   +   +
MSA                n/a +     +    n/a   19     3    0    n/a   +                     +          +          +     +       +   +   +
NSTEMI             n/a +     +    n/a   23    17    0    n/a   +                                +          +             +   +   +

6
    Primary care EHR available for participants in 2019; case-note review validation underway for multiple phenotypes.


                                                                                                                                     13
Parkinsonism      n/a +     +    n/a    19    3     0    n/a    +                      +           +          +     +     +     +     +
PD                n/a +     +    n/a    19    3     0    n/a    +                      +           +          +     +     +     +     +
PSP               n/a +     +    n/a    19    3     0    n/a    +                      +           +          +     +     +     +     +
SAH               n/a +     +    n/a    32    7     0    n/a    +                                  +          +           +     +     +
STEMI             n/a +     +    n/a    23    17    0    n/a    +                                  +          +           +     +     +
Stroke NS         n/a +     +    n/a    32    7     0    n/a    +                                  +          +           +     +     +
VD                n/a +     +    n/a    32    9     0    n/a    +                      +           +          +     +     +     +     +
Table 2. Information on EHR data sources, controlled clinical terminologies, available evidence of algorithm
  validation, algorithm representation format and implementation logic patterns from UK Biobank and CALIBER
  EHR phenotype algorithms
AAA Abdominal Aortic Aneurysm; AD Alzheimer's Disease; AF Atrial Fibrillation; AMI Acute Myocardial Infarction; AU Autoimmune Uveitis; BMI
Body Mass Index; BP Blood Pressure; BuP Bullous Pemphigoid; CHD Coronary Heart Disease; FTD Frontotemporal dementia; GCA Giant Cell Arteritis;
HCM Hypertrophic Cardiomyopathy; HDL High Density Lipoprotein cholesterol; HF Heart Failure; HIV Human Immunodeficiency Virus; HR Heart Rate;
HT Hypertension; ICH Intracerebral Haemorrhage; LDL Low Density Lipoprotein cholesterol; MS Multiple Sclerosis; NS Not Specified; PAD Peripheral
Arterial Disease; PBC Primary Biliary Cirrhosis; PMR Polymyalgia Rheumatica; RA Rheumatoid Arthritis; SA Stable Angina; SAH Subarachnoid
Haemorrhage; SCD Sudden Cardiac Death; TIA Transient Ischaemic Attack; UA Unstable Angina; UCD Unheralded Coronary Death; VD Vascular
Dementia; WBC White Blood Cell Count; COPD Chronic Obstructive Pulmonary Disease; ESRD End Stage Renal Disease; MND Motor Neuron Disease;
PD Parkinson's Disease and Parkinsonism; MSA Multiple System Atrophy; PSP Progressive Supranuclear Palsy; STEMI ST-Elevation AMI; NSTEMI Non-
ST Elevation AMI


                                                                                                                                          14

</pre>