Conversion of Bulgarian Observational Data to
OMOP Common Data Model: Initial Results

Petko Kovachev 1, Evgeniy Krastev 1, Dimitar Tcharaktchiev 2,
Emanuil Markov 3 and Ivan Evg. Ivanov 3
1
  Sofia University “St. Kliment Ohridski”, Faculty of Mathematics and Informatics,
James Bourchier blvd., No. 5, Sofia, 1164, Bulgaria
2
  Medical University of Sofia, University Hospital of Endocrinology, Zdrave street
No. 2, Sofia, 1431, Bulgaria
3
  Technical University of Sofia, Faculti of Automatics, Kliment Ohridsky blvd. No. 8,
Sofia, Bulgaria


             Abstract
             Common data models (CDMs) offer a standardized approach for data
             persistence and exchange. This is especially useful when nowadays-clinical
             data is distributed among heterogeneous sharing systems. Besides, OHDSI
             provides software tools in support on each stage of the ETL and ensure
             quality control. Therefore, data presented in CDM possesses all the features
             of a reliable source for a broad range of statistical analyses.
             This paper presents initial results of a research work done with the objective
             to transfer outpatient records from the Bulgarian Diabetes register into the
             OMOP CDM. One of the major challenges has been the extraction of clinical
             data from native language text as well as the use of international OMOP
             concepts to annotate data recorded in a Bulgarian context. The mapping
             of national encoding for drug codes was one of the serious obstacles to
             conceptual mapping that requires adaptation of such codes to corresponding
             drug codes in the International Classification of Diseases 9th Revision.

             Keywords
             eHealth, observational health data, common data model, ETL, data harmo-
             nization, electronic health records


Information Systems & Grid Technologies: Fifteenth International Conference ISGT’2022, May 27–28, 2022, Sofia, Bulgaria
EMAIL: az@petko.info (P. Kovachev); eck@fmi.uni-sofia.bg (E. Krastev); dimitardt@gmail.com (D. Tcharaktchiev);
emospy@gmail.com (E. Markov); ivan.evgeniev@gmail.com(I. Ivanov)
ORCID: 0000-0001-7509-4636 (P. Kovachev); 0000-0001-8740-5497 (E. Krastev);0000-0001-5765-840X (D. Tcharaktchiev);
0000-0002-8332-5884 (E. Markov); 0000-0002-0307-1600 (I. Ivanov)

            © 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
1. Introduction
     Digital health technologies produce huge amounts of data related to patient
health collected as part the execution of routine healthcare services under real-
world conditions. Data collected from such sources is collectively known as ob-
servational data (OD) OD is generated from a number of sources such as elec-
tronic health. OD is a valuable source for clinical evidence, which can be used to
evaluate the safety and effectiveness of medical products in treatment of socially
significant diseases like diabetes or cancer. Moreover, results from analysis of OD
provide evidence in support for clinical decision-making [1]. Therefore, many
research groups around the world attempt to integrate OD into a common data
model that can serve as a reliable source for analyses of healthcare data [2] [3].
     The Observational Health Data Sciences and Informatics [4] [5] (or OHDSI,
pronounced “Odyssey”) program is a multi-stakeholder, interdisciplinary collab-
orative initiative that aims to bring out the value of health data through large-
scale analytics. OHDSI objective is to generate accurate, reproducible, and well-
calibrated evidence and promote better health decisions and better care.
     The Observational Medical Outcomes Partnership (OMOP) [6]Common
Data Model (CDM) [7] is an open community data standard, designed to stan-
dardize the structure and content of observational data and to enable efficient
analyses that can produce reliable evidence. It is a unified database model that
allows integrating various OD sources including EHRs based on the standard.
     The European Health Data and Evidence Network project (EHDEN) [8] un-
der the Innovative Medicines Initiative (IMI) drives the adoption of the OMOP-
CDM in Europe in close collaboration with OHDSI.
     The OHDSI community and the software tools it is using follow the FAIR
Data Guiding Principles [9] [10]:
     • Findability – Any healthcare database that is mapped to OMOP and used
     for analytics should persist for future reference and reproducibility. There-
     fore, data are described with rich metadata, where metadata early and explic-
     itly include a globally unique and persistent identifier of the data it describes.
     • Accessibility – Accessibility of OMOP mapped data through an open pro-
     tocol is typically achieved through the SQL interface. The protocol must
     provide a procedure for authentication and authorization.
     • Interoperability – data use a formal, accessible, shared, and broadly ap-
     plicable language for knowledge representation. Additionally, data must be
     accompanied with vocabularies that follow FAIR principles with qualified
     references to other data.
     • Reusability – Metadata and data should be well described so that they
     can be replicated and/or combined in different settings. Moreover, data must
     satisfy domain-relevant community standards.


                                         114
     The OMOP Common Data Model (CDM) [7] is an open community data
standard, designed to standardize the structure and content of observational data
and to enable efficient analyses that can produce reliable evidence. A central com-
ponent of the OMOP CDM are the OHDSI standardized vocabularies (Figure 1).
     The OMOP Common Data Model allows systematic analysis of disparate
observational databases. The concept behind this approach is to transform data
contained within those databases into a common format (data model) as well as a
common representation (terminologies, vocabularies, coding schemes), and then
perform systematic analyses using a library of standard analytic routines that
have been written based on the common format.
     Routine health databases, based on routine electronic health records (EHRs)
differ in purpose, content and design. Common Data Models (CDM) can enable
standardized analysis of disparate data sources simultaneously.


Figure 1: Overview of all tables in the CDM version 6.0. Not all relationships
between tables are shown


                                       115
    The CDM contains standardized tables grouped in 16 Clinical Event tables,
10 Vocabulary tables, 2 metadata tables, 4 health system data tables, 2 health eco-
nomics data tables, 3 standardized derived elements, and 2 Results schema tables.
    • The development of the CDM follows the following design elements:
    • Suitability for purpose – CDM data is organized in a way that is suitable
    for analysis
    • Data protection – Personalized data, such as names, birthdays, and living
    address and so on is anonymized format.
    • Design of domains – The domains are modeled in a person-centric rela-
    tional data model, where for each record refers to a person and the date when
    the OD is captured.
    • Rationale for domains – Domains are identified and separately defined
    in an Entity-relationship diagram, where each domain has specific attributes
    that are not otherwise applicable. All other data can be preserved as an ob-
    servation in an entity-attribute-value structure.
    • Standardized Vocabularies – CDM relies on the Standardized Vocabular-
    ies such as SNOMED containing all necessary and appropriate correspond-
    ing standard healthcare concepts.
    • Reuse of existing vocabularies- definitions of codes drugs, diseases from
    national, industry standardization or vocabulary definitions are mapped to
    international coding systems or reused.
    • Maintaining source codes – The original source code is persisted togeth-
    er with their corresponding codes in Standardized Vocabularies, so that the
    model loses no information from the OD.
    • Technology neutrality – The CDM does.t depend on specific technology.
    It can be implemented on any relational database, such as MS SQL Server,
    Oracle etc.
    • Scalability – The CDM is optimized for data processing and computa-
    tional analysis to accommodate data sources that vary in size, including da-
    tabases with up to hundreds of millions of persons and billions of clinical
    observations.
    • Backwards compatibility – All changes from previous CDMs are clear-
    ly delineated. Older versions of the CDM can be easily created from this
    CDMv5, and no information is lost that was present previously
    There are implicit and explicit conventions that adopted in the CDM:
    • General Conventions of the Model – The CDM is considered a “person-
    centric” model, meaning that all clinical Event tables are linked to the PER-
    SON table.
    • General Conventions of Schemas – Most of the schemas are considered
    as read only, writable tables are only COHORT and COHORT_DEFINI-
    TION in “Results” schema.

                                       116
    • General Conventions of Data Tables – The CDM is platform indepen-
    dent. Data types are defined generically using ANSI SQL data types (VAR-
    CHAR, INTEGER, FLOAT, DATE, DATETIME, and CLOB). Precision is
    provided only for VARCHAR.
    • General Conventions of Domains – Events of different nature are orga-
    nized into Domains. These Events are stored in tables and fields, which are
    Domain­specific, and represented by Standard Concepts that are also Domain­
    specific as defined in the Standardized Vocabularies.

2. Methods
2.1. Environment preparation
    Preparing the CDM database environment by installing a PostgreSQL
DBMS, Java JDK and Docker compose in a Linux workstation. The database
setup includes:
    a) Creating the required database users;
    b) Creating the OMOP CDM tables with the Common Data Model/Post-
    greSQL DDL scripts; and
    c) Importing the standard OMOP vocabularies from athena.ohdsi.org.
    d) All the OHDSI sources are available at github.org/OHDSI. We in-
    stalled the White Rabbit, Rabbit-In-a-Hat, Usagi, [11] Achilles [12] and
    Broadsea repositories required for the OHDSI web applications, config-
    ured the addresses and JDBC URLs and started their respective Docker
    containers.

2.2. ETL – Extract Transform Load
    In order to get from the native/raw data to the OMOP Common Data Model
(CDM) we have to create an extract, transform, and load (ETL) process. This
process should restructure the data to the CDM, and add mappings to the Stan-
dardized Vocabularies with these steps (Figure 2).
    1. Design the ETL – To initiate an ETL process on a database we need to
    understand source data, including the tables, fields, and content. The White
    Rabbit software from OHDSI to perform a scan of the source data. The scan
    generates a report used as a reference when designing the ETL. With the
    White Rabbit scan in hand, we have a clear picture of the source data. We
    also know the full specification of the CDM. Rabbit-In-a-Hat [13] is used in
    next step is to perform mapping of source fields to the target in CDM data-
    base. Rabbit-In-a-Hat is designed to read and display a White Rabbit scan
    document and generates documentation for the ETL process but it does not
    generate code to create an ETL.


                                      117
    2. Create the Code Mappings- With Usagi form OHDSI tools we perform
    manual process of creating a code mappings with standard source codes to
    Vocabulary concepts.
    3. Implement the ETL – Once the design and code mappings are com-
    pleted, the ETL process is implemented with ETL-CDM Builder. As result,
    we have CDM relevant database populated with the data from the source
    database.
    4. Quality Control – For the extract, transform, load process, quality con-
    trol is iterative. The typical pattern is to write logic >
                                                             ­ implement logic ­> test
    logic >­ fix.


Figure 2: CDM ETL processes and tools /write logic

    The result of the ETL process is a CDM compliant database/schema ready to
be used for analyses.

2.3. Study execution
     The most convenient and precise approach to perform observational study
against CDM database is to use ATLAS [14]- free, publicly available, web based
tool developed by the OHDSI that facilitates the design and execution of analyses
on standardized, patient level, observational data in the CDM format. ATLAS is
deployed as a web application in combination with the OHDSI WebAPI and is
hosted on Apache Tomcat and could be deployed and started as Docker container
or cloud service.
     The screenshot of ATLAS in Figure 3 shows the various functionalities pro-
vided by ATLAS:
     • Data Sources – provides the capability review descriptive, standardized
     reporting for each of the configured data sources.
     • Vocabulary Search – provides the ability to search and explore the OMOP
     standardized vocabulary.
     • Concept Sets provides the ability to create collections of logical expres-
     sions that can be used to identify a set of concepts to be used throughout your
     standardized analyses.


                                        118
    • Cohort Definitions – ability to construct a set of persons who satisfy one
    or more criteria for a duration of time.
    • Characterizations – an analytic capability that allows you to look at one or
    more cohorts and to summarize characteristics about those patient populations.
    • Cohort Pathways Cohort pathways is an analytic tool that allows you to
    look at the sequence of clinical events that occur within one or more popula-
    tions.
    • Incidence Rates – a tool that allows you to estimate the incidence of out-
    comes within target populations of interest.
    • Profiles – tool that allows exploring of an individual patients longitudinal
    observational data to summarize what is going on within a given individual.
    • Population Level Estimation- a capability that allows defining a popu-
    lation study for level effect estimation using a comparative cohort design
    whereby comparisons between one or more target and comparator cohorts
    can be explored for a series of outcomes.
    • Patient Level Prediction – allows to apply machine-learning algorithms
    to conduct prediction analyses at patient level whereby you can predict an
    outcome within any given target exposures.
    • Jobs – used to explore the state of processes that are running through the
    WebAPI.
    • Configuration – to review the configured data sources that have been in
    the source configuration section.


Figure 3: ATLAS user interface


                                      119
3. Results
     In Bulgaria outpatient records are produced by the General Practitioners
(GPs) and the Specialists from Ambulatory Care for every contact with the pa-
tient. Outpatient records from patients with diabetes are maintained by the Bul-
garian Diabetes Register. These records represent a true example of OD that can
be produce valuable evidence for improving the treatment and management the
healthcare services for such patients. The outpatient records are semi-structured
files with predefined XML schema. The source XML documents included more
than 1 600 000 pseudonymized outpatient records. The most important indicators
in the records like Age, Gender, Location, Diagnoses are stored in explicit tags.
The Case history is presented as free text in the Anamnesis. Additionally, these
records include in native text information about the Patient status described the
patient state, symptoms, syndromes, patients’ height and weight, body mass in-
dex (BMI), blood pressure and other clinical concepts. The values of clinical tests
and lab data are enumerated also as free text in a separate section of the XML
document. A special section is dedicated to the prescribed treatment.
     This paper presents first results from a research work whose objective is to
convert this OD into OMOP CDM version 5.3.0. Here we will shortly describe
the first stage of the ETL process (Design the ETL) that will be used for mapping
of the source fields to the target in CDM database.
     Our first task has been to parse data from the available XML documents and
store it in a relational database, where the relational model matches the XML
schema of the source XML documents. Each one of the XML documents con-
tains a set of outpatient records of patients with diabetes. Therefore, the relational
database is referred to as Diabetes2018. Both the source database Diabetes2018
and the target CDM database are MS SQL Server databases. The Entity Relation-
ship Diagram of the source database and a description of the tables in the source
database are shown correspondingly in Table 1 and Figure 4.

Table 1
Description of tables in the source database
Source Table               English Name            Description
                                                   Outpatient records generated during a
AmbLists                   AmbLists                visits to GP or Specialist of Ambula-
                                                   tory Care, coded in ICD9.
Diagnoses                  Diagnoses               Diagnoses set in Outpatient record
Doctors                    Doctors                 Doctors data
                                                   List with hospitalization directions re-
HospitalizationLists       Hospitalization Lists
                                                   lated to an outpatient record
HospitalNapravlenie        HospitalNapravlenie     Direction for hospitalization


                                         120
                                       Direction for medical examination by
MDdirections      MDdirections
                                       Specialist in Ambulatory care
Patients          Patients             Patient description data
                                       Initial visit to a Specialist in Ambula-
PrimaryVisit      PrimaryVisit
                                       tory Care
                                       Procedures assigned in Outpatient re-
Procedures        Procedures
                                       cord
Profilact         Profilact            Describes a visit for Disease prevention
RecipeBooks       RecipeBooks          Recipe Books of Patient
                                       Recipes for reimbursable medicinal
Recipies          Recipies             products. These are stored in the pa-
                                       tient’s Recipe Book
Reimbursables     Reimbursables        Reimbursable medicinal products
                                       Secondary visit to a Specialist in Am-
SecondaryVisit    SecondaryVisit
                                       bulatory Care
SIMPConsults      SIMPConsults         Specialized Medical care
VSDSIMPConsults   VSDSIMPConsults      Highly Specialized Medical care
                                       Blood pressure values measured in
                                       mmHg extracted from natural lan-
BloodPressure     BloodPressure
                                       guage text description of patient status
                                       and examination data
                                       Blood sugar profile first value mea-
                                       sured in mmol/l extracted from natural
BloodSugar        BloodSugar
                                       text description of patient status and
                                       examination data
                                       Body Mass Index data extracted from
                                       natural language text description of
                                       patient status and examination data.
BMIdata           BMIdata              Table includes also Height and Weight
                                       measurements in sm and kg, when
                                       Height and Weight data are found in
                                       the text
                                       Contains values of HbAc1 extracted
                                       from natural language text description
HbAc1Data         HbAc1Data
                                       of patient status and examination data
                                       measured in mmol/mol
                                       Contains the number of years before
                                       the illness diabetes has been estab-
DiabetTimeData    DiabetTimeData       lished for the first time, extracted from
                                       natural language text description of
                                       patient status and examination data
                                       Contains values of triglycerides ex-
                                       tracted from natural language text de-
TrigData          TrigData
                                       scription of patient status and exami-
                                       nation data and measured in mmol/L


                                 121
Figure 4: Source data from DIAB2018 to CDMV5.3.0 database mapping

    Thus, clinical data for more than 502 000 patients with diabetes have been
loaded in Diabetes2018. In order to execute this task, we had to write programs in
Java and Python scripts that parse the XML documents and extract clinical data as
blood pressure, glucose, height and weight body mass index and many other mea-
surements recorded in the source XML documents in natural language text. Next,
we used the Rabbit-In-A-Hat tool for the mapping definition of data (Figure 5).


                                       122
Figure 5: Source data from DIAB2018 to CDMV5.3.0 database mapping

     The mapping rules identified in Figure 5 allow proceeding with the extract,
transform, and load stages of the ETL process using SQL queries implemented
in Microsoft SQL Server 2019. For completeness, with the help of White Rabbit
we generated a data dictionary for all the tables and fields that have been profiled.
This data dictionary includes the English translation of the local name of fields
and their description (Table 1). All the tasks at this stage have been accompa-
nied by continuous and tedious quality control so that the CDM database has all
the attributes of a reliable evidence, namely, repeatable, reproducible, replicable,
generalizable, robust and calibrated [15].


                                        123
4. Conclusion
     Common data models (CDMs) offer a standardized approach for data per-
sistence and exchange. This is especially useful when nowadays-clinical data
is distributed among heterogeneous sharing systems. Besides, OHDSI provides
software tools in support on each stage of the ETL and ensure quality control.
Therefore, data CDM possesses all the features of a reliable source for a broad
range of statistical analyses.
     This paper presents initial results of a research work done with the objective
to transfer outpatient records from the Bulgarian Diabetes register into the OMOP
CDM. One of the major challenges has been the extraction of clinical data from
native text as well as the use of international OMOP concepts to annotate data re-
corded in a Bulgarian context. The mapping of national encoding for drug codes
was one of the serious obstacles to conceptual mapping that requires adaptation
of such codes to corresponding drug codes in the International Classification of
Diseases 9th Revision.

5. Acknowledgements
    The presentation of this paper is supported by the National Scientific Pro-
gram “Electronic Healthcare in Bulgaria” (eHealth).

6. References
[1]   Y. Jeon, Y. Choi, E. Kim, S. Oh and H. Lee, “Common data model-based
      real-world data for practical clinical practice guidelines,” Transl Clin Phar-
      macol., vol. 28, no. 2, pp. 67–72, 2020.
[2]   A. Lamer, N. Depas, M. Doutreligne, A. Parrot, D. Verloop, M. Defebvre,
      G. Ficheur, E. Chazard and J. Beuscart, “Transforming French Electronic
      Health Records into the Observational Medical Outcome Partnership’s
      Common Data Model: A Feasibility Study,” Appl Clin Inform., vol. 11, no.
      1, pp. 13–22, 2020.
[3]   B. Ryu, E. Yoon, S. Kim, S. Lee, H. Baek, S. Yi, H. Na, J. Kim, R. Baek, H.
      Hwang and S. Yoo, “Transformation of Pathology Reports Into the Com-
      mon Data Model With Oncology Module: Use Case for Colon Cancer.,”
      J Med Internet Res., vol. 22, no. 12:e18526, 2020.
[4]   OHDSI, “Observational Health Data Sciences and Informatics – OHDSI,”
      www.ohdsi.org, 2022. [Online]. Available: https://www.ohdsi.org/web/
      wiki/doku.php?id=welcome. [Accessed 10 April 2022].
[5]   Observational Health Data Sciences and Informatics, The Book of OHDSI,
      https://ohdsi.github.io/TheBookOfOhdsi, 2021.


                                        124
[6]  OHDSI CDM Working Group, “Welcome to OMOP,” OHDSI, 2022. [On-
     line]. Available: https://ohdsi.org/omop. [Accessed 2 April 2022].
[7] Observational Health Data Sciences and Informatics, “OMOP Common
     Data Model,” ohdsi.org, 2022. [Online]. Available: https://ohdsi.github.io/
     CommonDataModel. [Accessed 10 April 2022].
[8] EHDEN, “The European Health Data & Evidence Network,” EHDEN,
     2022. [Online]. Available: https://www.ehden.eu. [Accessed 10 April 2022].
[9] OHDSI, “FAIR Principles,” 2016. [Online]. Available: https://www.go-fair.
     org/fair-principles. [Accessed 10 April 2022].
[10] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg et al., “The FAIR Guid-
     ing Principles for scientific data management and stewardship,” Scientific
     Data, vol. 3:160018, no. 1, 2016.
[11] Observational Health Data Sciences and Informatics, “Software Tools,”
     ohdsi.org, 2022. [Online]. Available: https://www.ohdsi.org/software-tools.
     [Accessed 10 April 2022].
[12] Observational Health Data Sciences and Informatics, “ACHILLES for data
     characterization,” ohdsi.org, 2022. [Online]. Available: https://www.ohdsi.
     org/analytic-tools/achilles-for-data-characterization. [Accessed 10 April
     2022].
[13] Observational Health Data Sciences and Informatics, “Rabbit-In-a-Hat,”
     2022, 15 February 2022. [Online]. Available: http://ohdsi.github.io/Whit-
     eRabbit/RabbitInAHat.html. [Accessed 10 April 2022].
[14] OHDSI, ”ATLAS,” 2022. [Online]. Available: https://github.com/OHDSI/
     Atlas/wiki. [Accessed 10 April 2022].
[15] C. Blacketer, M. Kallfelz and P. Rijnbeek, “WP5 – Data Workflow Imple-
     mentation & Service Deployment. D5.2 Report on Quality Assurance and
     Control Procedures,” EHDEN, 2020.


                                      125