=Paper= {{Paper |id=Vol-2022/paper38 |storemode=property |title= Integrating Data Analysis Tools for Better Treatment of Diabetic Patients |pdfUrl=https://ceur-ws.org/Vol-2022/paper38.pdf |volume=Vol-2022 |authors=Svetla Boytcheva,Galia Angelova,Zhivko Angelov,Dimitar Tcharaktchiev |dblpUrl=https://dblp.org/rec/conf/rcdl/BoytchevaAAT17a }} == Integrating Data Analysis Tools for Better Treatment of Diabetic Patients == https://ceur-ws.org/Vol-2022/paper38.pdf
      Integrating Data Analysis Tools for Better Treatment of
                        Diabetic Patients

            Svetla Boytcheva1, Galia Angelova1, Zhivko Angelov2, Dimitar Tcharaktchiev3
   1
     Institute of Information and Communication Technologies, Bulgarian Academy of Sciences,
                                           Sofia, Bulgaria
                                     2
                                       Adiss Lab Ltd., Sofia, Bulgaria
         3
           Medical University Sofia, University Specialized Hospital for Active Treatment of
                                   Endocrinology, Sofia, Bulgaria
 svetla.boytcheva@gmail.com, galia@lml.bas.bg, angelov@adiss-bg.com, dimitardt@gmail.com
            Abstract. This paper presents the construction and usage of an anonymous Diabetes Register for
     patients in Bulgaria. The Register is generated automatically from outpatient records submitted to the
     Bulgarian National Health Insurance Fund in 2010-2014 and continuously updated using outpatient records
     for 2015-2016. The construction relies on advanced automatic analysis of free text information as well as on
     Business Analytics technologies for storing, maintaining, searching, querying and analyzing data. Original
     frequent pattern mining algorithms enable to find patterns and sequences taking into account temporal
     information. The paper discussed the software environment as well as experiments in frequent pattern mining
     that enable knowledge discovery in the very large repository underlying the Register (currently 262 million
     pseudonymized outpatient records submitted to the Bulgarian National Health Insurance Fund in 2010-2016
     for more than 5 mln citizens yearly). The claim is that the synergy of modern analytics tools transforms a
     static archive of clinical patient records to a sophisticated software environment for knowledge discovery and
     prediction.
            Keywords: synergy of data management, data mining and text mining tools; clinical data; frequent
     pattern mining; data analytics; natural language processing; knowledge discovery

                                                                         In this paper we present the integration of various
 1 Introduction                                                      ICT tools for automatic generation of a Diabetes Register
                                                                     for Bulgarian patients. The huge amount of clinical data,
 Medicine is known as a Data Intensive Domain: due to                underpinning a repository of Outpatient Records (ORs),
 the recent penetration of the Information and                       enabled to construct interfaces that support both
 Communication Technologies (ICT) in all areas of our                monitoring functionalities (oriented to the health
 society, a rapidly increasing amount of medical data is             management authorities) and research-oriented
 produced by the healthcare sector, on the one hand, and             functionalities for knowledge discovery. The monitoring
 by biomedical research on the other hand. In the                    functionalities are based on business analytics while
 healthcare sector, ICT applications support health                  research tools use data mining and pattern search. The
 diagnostics, development and maintenance of medical                 software environment includes also components for
 Electronic Health Records, telemedicine and telecare,               automatic analysis of free texts in Bulgarian. These
 patient administration, almost all aspects of healthcare            components facilitate the Register generation and its
 management and healthcare delivery as well as medical               update because they deliver values of clinical tests and
 education and training. In biomedical research, progress            lab data which are described as unstructured text only.
 in deeper understanding of medical phenomena is sought                  This paper is structured as follows. Section 2
 by construction of big data models: e.g. virtual                    overview related work in several areas that are relevant
 physiological human, models of brain, in computational              to the subject: Diabetes registers, Natural Language
 genetics and so on. Public access to health information is          Processing (NLP) for clinical narratives, Business
 changing the relationship between the patients and the              Intelligence (BI) and analytics, Frequent Pattern Mining
 health institutions that are responsible for care delivery.         (FPM). Section 3 presents the experimental study context
 The monitoring and control function of patient                      and summarizes the developments during the last 3-4
 organizations is facilitated by the modern ICT tools as             years (because the Register was built iteratively). Section
 well. Today we are still in an early phase of a long-term           4 presents relevant achievements in automatic analysis of
 technological and social shift that will be implied by              clinical narratives in Bulgarian language. Section 5
 advancing further the ICT fundamentals and tools.                   discusses recent algorithms for frequent pattern mining
                                                                     and presents experiments related to knowledge discovery
Proceedings of the XIX International Conference                      in the Register repository. Section 6 contains the
“Data Analytics and Management in Data Intensive                     conclusion and plans for future work.
Domains” (DAMDID/RCDL’2017), Moscow, Russia,
October 10-13, 2017



                                                               230
2 Related Work                                                      sensitivity of the recognition. Despite the NLP
                                                                    limitations, the conclusion is that NLP engines are
2.1 Diabetic Registers                                              powerful components ready for integration in medical
                                                                    data mining and – due to improvements expected in the
There are several nation-wide Diabetes Registers in the
                                                                    future, e.g. more accurate mappings of terms to medical
world, e.g. in Denmark [1], Sweden [2], Norway [3].
                                                                    concepts – the importance of NLP as a valuable
Registers explicate the number of patients who are
                                                                    supporting technology will grow.
diagnosed with Diabetes and provide good monitoring
                                                                        Here we consider NLP for Bulgarian clinical text. No
and control. Constructing registers is expensive and
                                                                    comprehensive resources exist for Bulgarian medical
burdening the patients as well as the medical experts with
                                                                    language; the International Classification of Diseases
additional administrative work. Furthermore, in some
                                                                    ICD-10 is the only terminological resource which is
countries chronic disease management is not recognized
                                                                    available in electronic format. Our experience shows that
as a part of general medical practice. As for the
                                                                    within 2-3 years one can achieve good performance in
construction, most medical experts agree that Registers
                                                                    separate extraction tasks. We apply software prototypes
are a must since Diabetes is a chronic disease with
                                                                    developed some years ago that are gradually improved.
significant social consequences. Electronic patient
                                                                    The most useful tools are a drug extractor (it finds in the
registration systems are proposed like the one in Ireland
                                                                    free text the drug name, dosage, frequency and route of
[4] (but it is not implemented yet). It is interesting to
                                                                    admission [13]) as well as an extractor of numeric values
mention that in Sweden, during the Diabetic Register
                                                                    of lab data and clinical tests [14].
development phase 2001-2005, the registration rate of
patients gradually increased and reached 75% which in               2.3 Big Data, Business Intelligence Tools
2010 still remains stable and is one of the highest in the
country [5]. Thus infrastructure construction is a critical         Big Data usually designates a massive volume of
issue but data collection and update are further problems           structured and unstructured data, too large or too
that can be solved only by persistency and diligence.               dynamic to be processed by traditional software tools and
                                                                    techniques. The popular "3Vs" features of Big Data were
2.2 NLP of Clinical Narratives                                      first introduced by Gartner (previously META group):
                                                                    "high Volume, high Velocity, and/or high Variety" [15].
Usually automatic analysis of clinical narratives is
                                                                    Wikipedia is an example for big data consisting of
implemented partially: only fragments of the text are
                                                                    unstructured texts, images and hyperlinks. Big data
considered. The phrases, selected as “interesting”, are
                                                                    analytics is the process of collecting, organizing and
typically picked up due to the presence of a word or an
                                                                    analyzing big data to discover useful information.
entity which are considered “significant”. This approach
                                                                    Business Intelligence tools analyze big data of
for shallow analysis is called “Information Extraction”
                                                                    enterprises in order to provide historic, present and
(IE). IE from clinical texts matures only recently but its
                                                                    predicted views to the business processes. Predictive
accuracy gradually improves and often exceeds 90% [6].
                                                                    analytics for establishment of trends is the preferred
The review [6] stated in 2008 that “current applications
                                                                    functionality in contrast to databases that deal with data
are rarely applied outside of the laboratories they have
                                                                    items and extract subsets of data values. Visualization is
been developed in, mostly because of scalability and
                                                                    an important feature of BI tools because they show
generalizability issues”. Today, however, this is valid for
                                                                    generalizations and tendencies in one screen [16].
languages other than English because, with the active
                                                                    Another necessary feature is the speed of processing
contribution of numerous research groups in the USA,
                                                                    since big data often appear in real time.
NLP for English clinical narratives has much better
                                                                        In our project we use a BITool which stores data in
performance at present. Comprehensive language
                                                                    n-dimensional cubes and explores multi-dimensional
resources exist for English, such as UMLS [7] as well as
                                                                    data i.e. hyperplanes [17]. The user can split the dataset
tools like KnowledgeMap Concept Identifier [8] which
                                                                    into groups of objects with similar features. If temporal
processes clinical notes and returns CUIs (Concept
                                                                    dimension is included the user can track changes of
Unique Identifiers) for the recognized UMLS terms.
                                                                    object characteristics over time by animation. BITool
Another important tool is the public NegEx system
                                                                    enables the discovery of similar situations over time
which identifies and interprets negations in English
                                                                    when a search pattern is specified for a particular period.
clinical texts [9, 10]. We also mention the open-source
cTAKES1 (clinical Text Analysis and Knowledge                       2.4 Frequent Pattern Mining
Extraction System) and the Health Information Text
Extraction (HITEx) system [11]. A recent study [12]                 There are two principal tasks in pattern search: frequent
enumerates the advantages to incorporate NLP for                    pattern mining (FPM) where the events (objects) are
English in medical systems: it systematically links                 considered as unordered sets, and frequent sequence
several terms to a concept using databases that                     mining (FSM). Approaches for solving the FPM task
standardize health terminologies; avoids manual work                vary from the naïve BruteForce and Apriori algorithms,
for searching term variations; increases the number of              where the search space is organized as a prefix tree, to
patients in the considered cohorts and thus increases the           Eclat algorithm that uses tidsets directly for support

1
    http://ctakes.apache.org/




                                                              231
computation by processing prefix equivalence classes                they are extracted automatically from four XML fields:
[18]. Most FPM and FSM methods do not consider                      (i) Anamnesis: summarizes case history, previous
contextual information about extracted patterns. They               treatments, often family history, risk factors; (ii) Status:
usually build a (huge) prefix tree. Most FPM algorithms             summary of patient state, height, weight, BMI, blood
generate all possible frequent patterns (FPs).                      pressure etc.; (iii) Clinical tests: values of clinical
Summarized information for data relations can be                    examinations and lab data listed in arbitrary order; (iv)
extracted as maximal frequent itemsets (MFI) in order to            Prescribed treatment: codes of drugs reimbursed by
reduce redundancy and decrease significantly the                    NHIF, free text descriptions of other drugs. Integration
number of FPs for post-analysis. All classic algorithms             of large scale text analysis is a real novelty in this field.
for FPM can be modified for MFI search.
    We have proposed a novel algorithm for mining sets              3.2 Analytics Using BITool
of events in order to identify strong co-occurrence of              Today the system BITool supports the Diabetes Register
patterns [19]. It is a cascade data mining approach for             at the University Specialized Hospital for Active
FPM enriched with context information which aims at                 Treatment of Endocrinology ″Acad. Ivan Penchev″,
the discovery of complex relations between medical                  Medical University – Sofia (this Hospital was authorized
events with respective timestamps. Experiments with                 by the Bulgarian Ministry of Health to host the Register
this approach are presented in Section 5 to illustrate the          of diabetic patients in Bulgaria). BITool’s functionalities
functionality of the Diabetes Register as a research tool.          enable the monitoring of significant indicators like
                                                                    glycated hemoglobin (HbA1c) and blood glucose values.
3 Experimental Study                                                In this way the Register achieves its objective: to provide
                                                                    an adequate monitoring strategy for diabetic patients and
3.1 Principal Objective                                             to improve the healthcare and quality of life for the
A pseudonymized Register of diabetic patients was                   patients and their families. Two examples illustrate the
generated in 2015 from the Outpatient Records, collected            services. Figure 1 shows the number of diabetic patients
by the Bulgarian National Health Insurance Fund                     in the dimensions age-gender (at certain moment). Here
(NHIF), in compliance with all legal requirements for               BITool operates on the structured information from the
safety and data protection [20]. The usual patient                  NHIF archive: patient pseudonym, age and gender.
registration process was kept without burdening the                 Further statistics of this kind might concern explorations
medical experts with additional paper work. NHIF is the             of diabetic patients per region code, types of diabetes and
only obligatory Insurance Fund in Bulgaria so we note               diabetes complications, per GPs, per types of medication,
that working with ORs ensures 100% registration of all              according to frequency of visits etc.
patients who contacted the healthcare system at all
(however there are Bulgarian citizens who are not
insured and some others who have ORs but are not
properly diagnosed with Diabetes). The data repository,
underpinning the Register, currently contains more than
262 mln pseudonymised ORs submitted to the NHIF in
2010-2016 for more than 7.3 mln Bulgarian citizens
(more than 5 mln yearly), including 483,836 diabetic
patients. In Bulgaria ORs are produced by General
Practitioners (GPs) and Specialists from Ambulatory
Care whenever they contact patients. Despite the primary
accounting purpose ORs summarize sufficiently the case
                                                                    Figure 1 Number of diabetic patients grouped by age
and motivate the requested reimbursement. They are
semi-structured files with predefined XML-format.
Many indicators in the Register copy the structured data
submitted to NHIF in ORs: (i) date and time of the visit;
(ii) pseudonymized personal data, age, gender; (iii)
pseudonymised visit-related information; (iv) diagnoses
in ICD-10; (v) NHIF drug codes for medications that are
reimbursed; (vi) a code if the patient needs special
monitoring; (vii) a code concerning the need for
hospitalization; (viii) several codes for planned
consultations, lab tests and medical imaging.
    ORs contain also important values presented in free
text fields: glycated haemoglobin (HbA1c), body mass
index (BMI), weight, blood glucose and blood pressure               Figure 2 Reduction of HbA1c levels after application
etc. These values are essential for a Diabetic Register so          of incretin2 based drugs


2
    https://www.drugs.com/drug-class/incretin-mimetics.html




                                                              232
    Figure 2 explores the tendency in the development of            5 Research in Frequent Pattern Mining
treatment. It displays the number of patients who had
changes in the HbA1c levels within the interval [-5,5]              5.1 Contextual Information
units for certain period of time. For most patients the
                                                                    Most FSM and FPM approaches do not use contextual
HbA1c level decreased by 1 unit. The HbA1c levels are
                                                                    information about extracted patterns. These algorithms
extracted from the free text of ORs for the corresponding
                                                                    extract general templates but do not answer the major
patients with timestamp.
                                                                    question whether they are influenced in some way by the
    Finally we show the Register interface during the
                                                                    context and whether they are valid in various aspects.
process of exploring the collection of ORs (Figure 3).
                                                                    Existing methods which search for patterns using
The names and personal identifiers of patients and GPs
                                                                    contextual information are based on attributes that are
are replaced by pseudonyms; only the name of the
                                                                    organized into hierarchical structures and on attributes’
city/village remains in the address field.
                                                                    generalizations and specializations.
                                                                        Context information is organized as attributes of
                                                                    itemsets and tidsets. Attributes may have different
                                                                    organization - structured or unstructured. This enables to
                                                                    explore the context-dependent templates. Rabatel et al.
                                                                    [22] propose an approach in marketing domain taking
                                                                    into account not only the transactions that have been
                                                                    made but also various attributes associated with
                                                                    customers like age, gender etc. Attributes have a
                                                                    hierarchical structure (𝐻(𝐴𝑔𝑒), 𝐻(𝐺𝑒𝑛𝑑𝑒𝑟)) and
                                                                    explore patterns at different levels of attributes
                                                                    abstraction – lattice 𝐻 (Figure 4). Traditional methods
Figure 3 Exploring outpatient records in the Register               consider only the top level [∗,∗] - for any age and
                                                                    regardless of gender, i.e. without attributes. Rabatel et al.
4 NLP for Bulgarian Clinical Narratives                             designed the algorithm Gespan and made experiments
                                                                    with about 100,000 product descriptions from
Design and implementation of software for automatic                 amazon.com.
extraction of patient-related entities from a Big Data
collection is a quite challenging task. One needs to scale                                                                      [*,*]       H
up existing research prototypes to process millions of               H(age)       *
                                                                                            H(gender)
patient records, coping with noisy and missing data, and                  young       old
                                                                                                                [y,*]   [o,*]       [*,m]       [*,f]
                                                                                                   *
still providing reliable results. Some numeric entities                    (y)        (o)


refer to key risk factors for development of Diabetes                                       male
                                                                                            (m)
                                                                                                       female
                                                                                                         (f)
                                                                                                                [y,m]   [y,f]       [o,m]       [o,f]
Mellitus (levels of glycated hemoglobin HbA1c and
blood glucose) and cardio-vascular diseases (high blood             Figure 4 Structuring attributes in marketing domain
pressure). Unfortunately in the Bulgarian clinical                       Ziembiński [23] proposes a new approach for
practice these values are usually documented in free text           extracting small contextual models from smaller
paragraphs, presented in a huge variety of formats, so              collections of data that later are summarized in
their automatic identification is difficult. We note that           generalized models using information from contextual
according to some studies, today more than 80% of the               models with common information. This approach applies
patient-related clinical information is stored as free text         a metrics for measuring distance of context models. All
in the Electronic Health Record systems.                            values for similarity assessment are normalized in the
     In [21] we proposed a hybrid method for automatic              range between 0.0 and 1.0. Attribute values are
generation of grammar rules for IE from clinical data.              considered identical if the similarity function returns 1.0.
The experiments were made and evaluated over                        In the opposite case the result is 0.0. This approach
approximately 9.5 million of ORs. Here we cite only the             allows extracting patterns for data that would otherwise
evaluation of blood pressure extraction from the ORs of             have to be dropped out of the templates because of its
about 1,800,000 patients with arterial hypertension for 3           dispersion and low frequency.
year period: all available values are about 38.3 million
and the extraction was performed with precision 92%                 5.2 Experimental Setup
and recall 98%. The variety of recording formats and
explanations written by thousands of medical                        We apply a retrospective analysis for patients from the
professionals require constant evaluation of grammar                Diabetes Register with Diabetes Type 2. The period of
coverage and extraction accuracy in general. Some of the            interest is two years preceeding the onset of the Diabetes
main advantages of the proposed method, beyond its                  Type 2, i.e. the so called prediabetes condition. In order
reliable performance and good precision in text mining,             to illustrate the potential of contextualized FPM we
are the modularity, extensibility, and scalability.                 present results in searching comorbidities for patients in
                                                                    prediabet condition. Text mining modules are used to
                                                                    convert raw text descriptions to structured event data.




                                                              233
                                                                               𝐸(𝑝𝑖 ). Let 𝐷 ⊆ 𝑃 × 2𝐼 be the set of all itemsets in our
                               Outpatient                                      collection after projection 𝜋 in the format
                                Records
                                                                               〈𝑝𝑖𝑑, 𝑖𝑡𝑒𝑚𝑠𝑒𝑡〉. We shall call 𝐷 a database. We are
                                                                               looking for itemsets 𝑋 ⊆ 𝐼 with frequency (sup(𝑋))
                                                                               above given 𝑚𝑖𝑛𝑠𝑢𝑝. Let ℱ denote the set of all frequent
                                                                               itemsets, i.e. ℱ = {𝑋| 𝑋 ⊆ 𝐼 𝑎𝑛𝑑 sup(𝑋) ≥ 𝑚𝑖𝑛𝑠𝑢𝑝}. A
                              Preprocessing
                                                                               frequent itemset 𝑋 ∈ ℱ is called maximal if it has no
                                  Structured
              Text
                                 Information
                                                            Data               frequent supersets. Let ℳ denote the set of all maximal
             Mining                                        Modeling
                                  Processing                                   frequent itemsets, i.e. ℳ = {𝑋| 𝑋 ∈ ℱ 𝑎𝑛𝑑 ∄ 𝑌 ∈
                                                                               ℱ, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑋 ⊂ 𝑌}. Let 2𝑋 denote the power set (set
                                                                               of all subsets) of itemset 𝑋. Then each subset of 𝑋 ∈ ℱ is
                               Data Analysis
                                                                               also a frequent itemset, i.e. ∀ 𝑌 ∈ 2𝑋 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑡ℎ𝑎𝑡 𝑌 ∈
                                                                               ℱ . For each item 𝑖𝑑 ∈ 𝐼 we define the set called pidset:
            Structured           Association                 Context
           Information              Rules                  Information         𝑝(id) = {𝑝𝑖 | 〈𝑝𝑖 , 𝐼(𝑝𝑖 )〉 ∈ 𝐷 𝑎𝑛𝑑 𝑖𝑑 ∈ 𝐼(𝑝𝑖 )}.
            Processing           Generation                 Processing             To study the nature of comorbidities we need to
                                                                               investigate the context in which they occur. Therefore we
                                                                               add some semantic attributes to each event –
                Prediction & Prevention Models                                 demographics of patients, age and gender, treatment,
                                                                               status, lab data and etc.
                         Comorbidity        Risk Factors                           We define a set of attributes of interest 𝐴 =
                           Analysis
                                                                               {𝑎1 , 𝑎2 , … , 𝑎𝑘 }. Context Q for some patient 𝑝𝑖 ∈ 𝑃 is
                                                                               defined as the set of attribute-value pairs from patient
Figure 5 System Architecture                                                   profile information:
                                                                                   𝑄(𝑝𝑖 ) = {〈𝑎1 , 𝑞1 〉, 〈𝑎2 , 𝑞2 〉, … , 〈𝑎𝑘 , 𝑞𝑘 〉}.
    The search space is very large: the database is big, the                   In order to decrease the number of possible values of
number of diseases is also large. We propose a tabular                         attributes we apply some aggregation of data. For
method using a vertical database, depth-first traversal as                     instance age value is categorized according to the World
well as set intersection and diffsets [19]. Further                            Health Organization (WHO) standard age groups. Data
processing of the maximal frequent itemsets (MFI) is                           for body mass index (BMI) are also categorized
applied to remove diagnostic-related groups. In addition                       according to the WHO4 standard classification -
some context information is added to each MFI to                               underweight, normal weight, overweight, obesity.
investigate comorbidities. Furthermore association rules                           For some data concerning demographic information,
with lift are generated. The context information is                            like region ID we have large number of distinct values.
represented as attribute-value tuples for each patient; the                    For such data we add also some additional properties
post-processing identifies the importance of different                         concerning background information for the region – e.g.
attributes for each MFI.                                                       whether it is south, north, west, east, central, northwest
    The architecture of the experimental workbench is                          etc., and mountain, river, sea, thermal spring, urban
shown on Figure 5. Our research [19] aims to develop                           region etc. For status and clinical test data we take the
further the ideas of the two contextual approaches for                         worst value for the period, according to the risk factors
data mining [22, 23].                                                          deinition.
    For the collection S of ORs we extract the set of all                          In primary interest for Diabetes Type 2 are BMI,
different patient identifiers 𝑃 = {𝑝1 , 𝑝2 , … , 𝑝𝑁 }. This set                glycated haemoglobin, blood pressure (RR – Riva Roci),
corresponds to transaction identifiers (tids) and we call                      blood glucose, HDL-cholesterol.
them pids (patient identifiers). We consider each patient                          From 𝑄(𝑝𝑖 ) we generate a feature vector 𝑣(𝑝𝑖 ) =
visit to a doctor as a single event. For each patient 𝑝𝑖 ∈ 𝑃                   (𝑣1𝑖 , 𝑣2𝑖 , … , 𝑣𝑚𝑖 ), where each attribute 𝑎𝑗 ∈ 𝐴 with 𝑁𝑗
an event sequence of tuples 〈𝑒𝑣𝑒𝑛𝑡, 𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝〉 is                              possible values is represented by 𝑁𝑗 consecutive
generated: 𝐸(𝑝𝑖 ) = (〈𝑒1 , 𝑡1 〉, 〈𝑒2 , 𝑡2 〉, … , 〈𝑒𝑘𝑖 , 𝑡𝑘𝑖 〉), 𝑖 =            positions in the vector. For the set of maximal frequent
̅̅̅̅̅
1, 𝑁. Let ℰ be the set of all possible events and 𝒯 be the                     itemset ℳ with cardinality |ℳ| = K we have K classes
set of all possible timestamps. Let 𝐼 = {𝑖𝑑1 , 𝑖𝑑2 , … , 𝑖𝑑𝑝 }                 of comorbidities. We apply classification of multiple
be the set of all diseases ICD-103 codes, which we call                        classes in order to generate rules for each comorbidity
items. Each subset 𝑋 ⊆ 𝐼 is called an itemset. We define                       class. We use large scale multi class classification
a projection function 𝜋: (ℰ × 𝒯)𝑁 → 2𝐼 : 𝜋(𝐸(𝑝𝑖 )) =                           because we deal with a big database and a large group of
 𝐼(𝑝𝑖 ) = (𝑖𝑑1i , 𝑖𝑑2i , … , 𝑖𝑑𝑚𝑖 ), such that for each patient                comorbidity classes. We use Support Vector Machines
                                                                               (SVM) and optimization based on block minimization
𝑝𝑖 ∈ 𝑃 the projected time sequence contains only the
                                                                               method described by Yu et al. [24].
first occurrence (onset) of each disorder recorded in


3                                                                              4
 International Classification of Diseases and Related                            WHO, BMI Classification http://apps.who.int/bmi/
Health Problems 10th Revision. http://apps.who.int/                            index.jsp?introPage=intro_3.html
classifications/icd10/browse/2015/en




                                                                         234
                          Table 1. Data analysis results for patients in prediabetes condition

       Set                                    2013                           2014                      2013-2014
                                       ICD-10      ICD-10             ICD-10      ICD-10           ICD-10      ICD-10
       items
                                        3 signs     4 signs            3 signs     4 signs          3 signs     4 signs
       Patients                         27,082      27,082             27,902      27,902           29,205      29,205
       Outpatient Records              267,194     267,194            296,129     296,129          556,323    556,323
       ICD-10 codes                      1,142       4,701              1,145       4,834            1,257       5,503
       minsup                              0.01        0.01               0.01        0.01             0.01        0.01
       Total MFI                            203         486                219         512              521      1,406
       Longest MFI                            5           8                  5           9                6           9
       Frequent Itemsets                    608      7,452                 689      8,935            1,909      32,093
       Association Rules                    686     58,299                 810     78,052            2,722    381,012



5.3 Experiments and Results
                                                                    Table 2. Data for attributes in the collections
   We report results for patients with Diabetes Type 2
onset in 2015. The ORs of these patients for the period               𝐴    attribute           2013       2014    2013-2014
2013-2014 were excerpted from the Diabetes Register                   𝑎1   age                27,082     27,902      29,205
when, as we assume, these patients were in a pre-diabetes             𝑎2   gender             27,082     27,902      29,205
condition. The idea of this experiment is to check                    𝑎3   region             27,082     27,902      29,205
whether we can successfully discover risk factors for                 𝑎4   bmi                21,659     22,413      27,928
these patients looking only at their ORs in 2013 and                  𝑎5   HbA1c                 153        238         370
2014. Then, maping our hypotheses to the real data for                     HDL
2015, we test whether our approach is reasonable. (We                 𝑎6                       4,917      4,815           6,952
                                                                           cholesterol
note that due to the relatively short period of observation           𝑎7   blood glucose      11,925     12,185         17,016
and lack of data about mortality, at the moment we
cannot follow diabetes development in longer periods.)
   In the Register each OR, corresponding to a single               One of the generated Maximal Frequent Itemsets
visit, cointains up to 4 diagnoses encoded in ICD-10.               (comorbidity class), whose support contains the pid=
Some diagnoses are presented by 4-sign encodings, i.e.              2196365, is:
in a more specific way, while others use the more general               MFI#12: Z00 I10 M51 #SUP: 453
3-sign encoding. Due to the hierarchical organization of                The following charts show the distribution of patients
ICD-10 we shall analyse individually two collections:               in the support of "MFI#12" according to their age (Figure
the original one, that is more specific (with 4-sign codes          6), gender (Figure 7), BMI (Figure 8), and the HDL
- see Example 1) and we shall generalise also all                   Cholesterol (Figure 9) correspondingly.
diagnoses to more general classes (with 3-sign codes -                  We can observe that most patients in this support set
see Example 2). The examples present collections of                 have higher risk of Diabetes Type 2, due to the presence
diagnoses for a patient with ID 2196365.                            of multiple risk factors as obesity, medium or high levels
   Example 1:                                                       of cholesterol and hypertension (diagnose with ICD-10
I(2196365)={I10, M10.9, M10, K76.9, K76,                            code I10).
L94.1, L94, M06.9, G57.9, Z00.8, H53,
M51.1, M33.9}
                                                                                              AGE
   Example 2:
I(2196365)={I10, M10, K76, L94, M06, M51,                            200
M33, H53, Z00, G57}
                                                                     150
    For some patients, the available ORs contain no
information about certain attributes of the context                  100
information (Table 2). It is well known that missing data
in medical documentation is inevitable. Thus some                     50
attribute values are replaced by the value NA, which is
considered as the most general value.                                  0
   For example the context information for the patient                        15-44        45-59        60-69         70-89
with ID 2196365 is:                                                 Figure 6 Age of the patients in the support set of
                                                                    "MFI#12"
𝑄 (2196365) = {〈𝑎𝑔𝑒, 58〉, 〈𝑔𝑒𝑛𝑑𝑒𝑟, 1〉,
 〈𝑟𝑒𝑔𝑖𝑜𝑛, 03〉, 〈𝑏𝑚𝑖, 29.32〉, 〈ℎ𝑏𝑎1𝑐, 𝑁𝐴〉,
 〈𝑏𝑙𝑜𝑜𝑑_𝑔𝑙𝑢𝑐𝑜𝑠𝑒, 6.39〉, 〈ℎ𝑑𝑙_𝑐ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙, 1.15〉}




                                                              235
                                                                     with ICD-10 code M51 (Thoracic, thoracolumbar, and
                          GENDER                                     lumbosacral intervertebral disc disorders) means that the
                                                                     patients have lower motor activity and sedentary
                                                                     lifestyle, which causes obesity, overweight, higher
                              male                                   values of cholesterol and blood pressure and therefore
                              35%                                    increases the risk of developing Diabetes. Actually this
                                                                     has happened in 2015. We note that in general the ICD-
                female                                               10 diagnose M51 is not considered risky for Diabetes.
                  65%                                                But our algorithm reveals this unknown and latent
                                                                     interrelationship.

                                                                     6 Conclusion and Future Work
Figure 7 Gender of the patients in the support set of                In this paper we present a software environment for
"MFI#12"                                                             collection and processing of Big Data in medicine - a
                                                                     Data Intensive Domain. The Diabetes Register has been
                                                                     developed stepwise and its research functionality is still
                            BMI    underweight                       under construction. We believe that the integration of
                                       0%
                                                                     various technologies is the proper way to approach the
                                             normal                  challenges of large-scale information processing because
                                               9%                    the integration ensures flexible multi-functionality and
                obesity                                              enables reuse of results.
                 55%           overweight                                The nation-wide Diabetes Register of Bulgaria is now
                                  36%                                visible in Internet5 together with some public statistical
                                                                     information. We plan to develop the Register further as a
                                                                     predictive and preventing tool using the synergy of
                                                                     advanced technologies which enable to discover risk
                                                                     groups of patients that have predisposition to various
                                                                     socially-significant diseases. We have shown here that
Figure 8 BMI of the patients in the support set of                   the present software environment is mature enough to
"MFI#12"                                                             identify patients with complexes of risk factors for
                                                                     development of Diabetes, e.g. risks like: family history
                                                                     (relatives with Diabetes); obesity; arterial hypertonia
                  HDL CHOLESTEROL                                    (RR>140/90); low physical activity; giving birth to a
                                                                     baby with weight more than 4 kg or gestational Diabetes;
                                                                     established impaired fasting glycaemia or impaired
                          low high                                   glucose tolerance; other states of insulin resistance (e.g.
                          21% 23%                                    acanthosis nigricans, a specific hyperpigmentation of the
                                                                     skin that might be due to endocrine disorders); HDL-
                          medium                                     cholesterol≤0.90 mmol/l or triglycerides≥2.2 mmol/l
                                                                     (≥2.82 mmol/l according to ADA); diagnosed polycystic
                           56%
                                                                     ovarian syndrome, a cardio-vascular disease, or mental
                                                                     disorders etc. These risk factors are explicated in the
                                                                     patient-related documents either by values of clinical
                                                                     tests or by keywords and typical phrases that describe the
Figure 9 Levels of HDL Cholesterol of the patients in                factor. The patients with predisposition suffer from
the support set of "MFI#12"                                          disorders and syndromes, diagnosed by various medical
                                                                     specialists in various time periods, but without any
   Data about HbA1c are available only for 3 out of 453              chance to establish connections between the medical
patients, that is why we consider this attribute as a more           doctors – e.g. a connection between a Psychiatrist and a
general value ANY. But we note that the lack of HbA1c                Cardiologist that have consulted the patient. Elaborating
measurements is not surprising because tests for HbA1c               further the analytics facility of the Register will provide
are made when the Diabetes is diagnosed (and this has                functionality to monitor patient status over time, in the
happened in 2015 for the selected patient cohort).                   context of all available information, and to issue alerts
    Data for blood glucose are available only for 30% of             for coincidence of risk factors that open the door to
these patient and for 50% of them the values were high.              Diabetes and other chronic diseases. In this way we
    Deeper analyses reveal medical arguments why                     believe that in the foreseeable future it will become
higher risk exist especially for the patients in the support         possible to identify the Bulgarian citizens who have
set of MFI#12: Z00 I10 M51 #SUP: 453. The diagnose                   predisposition to develop Diabetes Mellitus.

5
    http://usbale.com/Register_Diabetes.htm




                                                               236
Acknowledgements                                                    [13] Boytcheva, S.: Shallow Medication Extraction
                                                                         from Hospital Patient Records. Studies in Health
The research presented here is partially supported by the                Technology and Informatics. vol. 166, pp. 119-
grant 02/4 SpecialIZed Data MIning MethoDs Based on                      128. IOS Press (2011)
Semantic Attributes (IZIDA), funded by the National
                                                                    [14] Tcharaktchiev, D., Angelova, G., Boytcheva, S.,
Science Fund in 2017–2019. The support of Medical
                                                                         Angelov, Z., Zacharieva, S.: Completion of
University – Sofia, the Bulgarian Ministry of Health and
                                                                         Structured Patient Descriptions by Semantic
the National Health Insurance Fund is acknowledged.
                                                                         Mining. Studies in Health Technology and
                                                                         Informatics, vol. 166, pp. 260–269. IOS Press
References                                                               (2011). doi: 10.3233/978-1-60750-740-6-260
 [1] Carstensen, B. et al.: The Danish National                     [15] Laney, D.: 3D Data Management: Controlling
     Diabetes Register: Trends in incidence, prevalence                  Data Volume, Velocity, and Variety. META
     and mortality. Diabetologia. 51(12), 2187–2196                      Group Research Note, 6, 10 (2001)
     (2008). doi: 10.1007/s00125-008-1156-z                              https://blogs.gartner.com/doug-laney/files/2012/
 [2] Hallgren Elfgren, I. M., Grodzinsky, E., Törnvall,                  01/ad949-3D-Data-Management-Controlling-
     E.: The Swedish National Diabetes Register in                       Data-Volume-Velocity-and-Variety.pdf
     clinical practice and evaluation in primary health             [16] Top 238 Business Analytics Tools. Predictive
     care. Prim. Health Care Res. Dev. 17(6), 549-558                    Analytics Magazine (Feb 2012).
     (2016). doi: 10.1017/S1463423616000098                              http://www.predictiveanalyticstoday.com/top-
 [3] Cooper, J. G., Thue, G., Claudi, T., Løvaas, K.,                    business-intelligence-tools/
     Carlsen, S., Sandberg, S.: The Norwegian Diabetes              [17] Angelova, G., Nikolova, I., Angelov, Zh.:
     Register for Adults – an overview of the first years.               Embedding language technologies in a data
     Norsk Epidemiologi. 23(1), 29-34 (2013)                             analytics tool. Advances in Bulgarian Sciences,
 [4] O'Mullane, M., McHugh, S., Bradley, C. P.:                          pp. 29-42. National Centre for Information and
     Informing the development of a national diabetes                    Documentation (2016). ISSN: 1314-3565
     register in Ireland: a literature review of the impact         [18] Nasreen, S., Azam, M. A., Shehzad, K., Naeem,
     of patient registration on diabetes care. Inform.                   U., Ghazanfar, M. A.: Frequent Pattern Mining
     Primary Care. 18(3), 157-68 (2010)                                  Algorithms for Finding Associated Frequent
 [5] Hallgren Elfgren, I.M., Törnvall, E., Grodzinsky,                   Patterns for Data Streams: A Survey. Procedia
     E.: The process of implementation of the diabetes                   Computer Science, 37, 109-116 (2014)
     register in Primary Health Care. Int. Journal of               [19] Boytcheva, S., Angelova, G., Angelov, Z.,
     Qual. Health Care. 24(4), 419-424 (Aug 2012)                        Tcharaktchiev, D.: Mining Comorbidity Patterns
 [6] Meystre, S., Savova, G., Kipper-Schuler, K.,                        Using Retrospective Analysis of Big Collection of
     Hurdle, J. F.: Extracting Information from Textual                  Outpatient Records. Health Inf Sci Syst. Journal,
     Documents in the Electronic Health Record: A                        Springer (2017). ISSN: 2047-2501 (to appear)
     Review of Recent Research. IMIA Yearbook of                    [20] Tcharaktchiev, D., Zacharieva, S., Angelova, G.,
     Medical Informatics, pp. 138-154. (2008)                            Boytcheva, S., et al. Building a Bulgarian National
 [7] UMLS, the Unified Medical Language System.                          Registry of Patients with Diabetes Mellitus.
     https://www.nlm.nih.gov/research/umls/                              Journal of Social Medicine. 2, 19-21 (2015) (in
                                                                         Bulgarian)
 [8] Denny, J. C., Irani, P. R., Wehbe, F. H., Smithers,
     J. D., Spickard, A.: The KnowledgeMap Project:                 [21] Boytcheva, S., Angelova, G., Angelov, Z.,
     Development of a Concept-Based Medical School                       Tcharaktchiev, D.: Text Mining and Big Data
     Curriculum Database. In: AMIA Annu Symp                             Analytics for Retrospective Analysis of Clinical
     Proc., pp. 195–199. (2003)                                          Texts from Outpatient Care. Cybernetics and
                                                                         Information Technologies, 15(4), 58-77 (2015).
 [9] Chapman, W., Bridewell, W., Hanbury, P.,
                                                                         doi: 10.1515/cait-2015-0055
     Cooper, G. F., Buchanan, B.: A Simple Algorithm
     for Identifying Negated Findings and Diseases in               [22] Rabatel, J., Bringay, S., Poncelet, P.: Mining
     Discharge Summaries. Univ. of Pittsburgh (2002)                     sequential patterns: a context-aware approach.
                                                                         Advances in Knowledge Discovery and
[10] Gindl, S.: Negation Detection in Automated
                                                                         Management, pp. 23-41. Springer (2013)
     Medical Applications. TUW (2006)
                                                                    [23] Ziembiński, R. Z.: Accuracy of generalized
[11] HITEx Manual: https://www.i2b2.org/software/
                                                                         context patterns in the context based sequential
     projects/hitex/hitex_manual.html
                                                                         patterns mining. Control and Cybernetics. 40, 585-
[12] Liao, K. P., Cai, T., Savova, G. K., Murphy, S. N.,                 603 (2011)
     Karlson, E. W., Ananthakrishnan, A. N., Gainer,
                                                                    [24] Yu, H. F., Hsieh, C. J., Chang, K. W., Lin, C. J.:
     V. S. et al.: Development of phenotype algorithms
                                                                         Large linear classification when data cannot fit in
     using electronic medical records and incorporating
                                                                         memory. ACM Transactions on Knowledge
     natural language processing. British Med. J., 350
                                                                         Discovery from Data (TKDD), 5(4), 23 (2012)
     (1): h1885 (2015)




                                                              237