=Paper=
{{Paper
|id=Vol-2022/paper38
|storemode=property
|title=
Integrating Data Analysis Tools for Better Treatment of Diabetic Patients
|pdfUrl=https://ceur-ws.org/Vol-2022/paper38.pdf
|volume=Vol-2022
|authors=Svetla Boytcheva,Galia Angelova,Zhivko Angelov,Dimitar Tcharaktchiev
|dblpUrl=https://dblp.org/rec/conf/rcdl/BoytchevaAAT17a
}}
==
Integrating Data Analysis Tools for Better Treatment of Diabetic Patients
==
Integrating Data Analysis Tools for Better Treatment of Diabetic Patients Svetla Boytcheva1, Galia Angelova1, Zhivko Angelov2, Dimitar Tcharaktchiev3 1 Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria 2 Adiss Lab Ltd., Sofia, Bulgaria 3 Medical University Sofia, University Specialized Hospital for Active Treatment of Endocrinology, Sofia, Bulgaria svetla.boytcheva@gmail.com, galia@lml.bas.bg, angelov@adiss-bg.com, dimitardt@gmail.com Abstract. This paper presents the construction and usage of an anonymous Diabetes Register for patients in Bulgaria. The Register is generated automatically from outpatient records submitted to the Bulgarian National Health Insurance Fund in 2010-2014 and continuously updated using outpatient records for 2015-2016. The construction relies on advanced automatic analysis of free text information as well as on Business Analytics technologies for storing, maintaining, searching, querying and analyzing data. Original frequent pattern mining algorithms enable to find patterns and sequences taking into account temporal information. The paper discussed the software environment as well as experiments in frequent pattern mining that enable knowledge discovery in the very large repository underlying the Register (currently 262 million pseudonymized outpatient records submitted to the Bulgarian National Health Insurance Fund in 2010-2016 for more than 5 mln citizens yearly). The claim is that the synergy of modern analytics tools transforms a static archive of clinical patient records to a sophisticated software environment for knowledge discovery and prediction. Keywords: synergy of data management, data mining and text mining tools; clinical data; frequent pattern mining; data analytics; natural language processing; knowledge discovery In this paper we present the integration of various 1 Introduction ICT tools for automatic generation of a Diabetes Register for Bulgarian patients. The huge amount of clinical data, Medicine is known as a Data Intensive Domain: due to underpinning a repository of Outpatient Records (ORs), the recent penetration of the Information and enabled to construct interfaces that support both Communication Technologies (ICT) in all areas of our monitoring functionalities (oriented to the health society, a rapidly increasing amount of medical data is management authorities) and research-oriented produced by the healthcare sector, on the one hand, and functionalities for knowledge discovery. The monitoring by biomedical research on the other hand. In the functionalities are based on business analytics while healthcare sector, ICT applications support health research tools use data mining and pattern search. The diagnostics, development and maintenance of medical software environment includes also components for Electronic Health Records, telemedicine and telecare, automatic analysis of free texts in Bulgarian. These patient administration, almost all aspects of healthcare components facilitate the Register generation and its management and healthcare delivery as well as medical update because they deliver values of clinical tests and education and training. In biomedical research, progress lab data which are described as unstructured text only. in deeper understanding of medical phenomena is sought This paper is structured as follows. Section 2 by construction of big data models: e.g. virtual overview related work in several areas that are relevant physiological human, models of brain, in computational to the subject: Diabetes registers, Natural Language genetics and so on. Public access to health information is Processing (NLP) for clinical narratives, Business changing the relationship between the patients and the Intelligence (BI) and analytics, Frequent Pattern Mining health institutions that are responsible for care delivery. (FPM). Section 3 presents the experimental study context The monitoring and control function of patient and summarizes the developments during the last 3-4 organizations is facilitated by the modern ICT tools as years (because the Register was built iteratively). Section well. Today we are still in an early phase of a long-term 4 presents relevant achievements in automatic analysis of technological and social shift that will be implied by clinical narratives in Bulgarian language. Section 5 advancing further the ICT fundamentals and tools. discusses recent algorithms for frequent pattern mining and presents experiments related to knowledge discovery Proceedings of the XIX International Conference in the Register repository. Section 6 contains the “Data Analytics and Management in Data Intensive conclusion and plans for future work. Domains” (DAMDID/RCDL’2017), Moscow, Russia, October 10-13, 2017 230 2 Related Work sensitivity of the recognition. Despite the NLP limitations, the conclusion is that NLP engines are 2.1 Diabetic Registers powerful components ready for integration in medical data mining and – due to improvements expected in the There are several nation-wide Diabetes Registers in the future, e.g. more accurate mappings of terms to medical world, e.g. in Denmark [1], Sweden [2], Norway [3]. concepts – the importance of NLP as a valuable Registers explicate the number of patients who are supporting technology will grow. diagnosed with Diabetes and provide good monitoring Here we consider NLP for Bulgarian clinical text. No and control. Constructing registers is expensive and comprehensive resources exist for Bulgarian medical burdening the patients as well as the medical experts with language; the International Classification of Diseases additional administrative work. Furthermore, in some ICD-10 is the only terminological resource which is countries chronic disease management is not recognized available in electronic format. Our experience shows that as a part of general medical practice. As for the within 2-3 years one can achieve good performance in construction, most medical experts agree that Registers separate extraction tasks. We apply software prototypes are a must since Diabetes is a chronic disease with developed some years ago that are gradually improved. significant social consequences. Electronic patient The most useful tools are a drug extractor (it finds in the registration systems are proposed like the one in Ireland free text the drug name, dosage, frequency and route of [4] (but it is not implemented yet). It is interesting to admission [13]) as well as an extractor of numeric values mention that in Sweden, during the Diabetic Register of lab data and clinical tests [14]. development phase 2001-2005, the registration rate of patients gradually increased and reached 75% which in 2.3 Big Data, Business Intelligence Tools 2010 still remains stable and is one of the highest in the country [5]. Thus infrastructure construction is a critical Big Data usually designates a massive volume of issue but data collection and update are further problems structured and unstructured data, too large or too that can be solved only by persistency and diligence. dynamic to be processed by traditional software tools and techniques. The popular "3Vs" features of Big Data were 2.2 NLP of Clinical Narratives first introduced by Gartner (previously META group): "high Volume, high Velocity, and/or high Variety" [15]. Usually automatic analysis of clinical narratives is Wikipedia is an example for big data consisting of implemented partially: only fragments of the text are unstructured texts, images and hyperlinks. Big data considered. The phrases, selected as “interesting”, are analytics is the process of collecting, organizing and typically picked up due to the presence of a word or an analyzing big data to discover useful information. entity which are considered “significant”. This approach Business Intelligence tools analyze big data of for shallow analysis is called “Information Extraction” enterprises in order to provide historic, present and (IE). IE from clinical texts matures only recently but its predicted views to the business processes. Predictive accuracy gradually improves and often exceeds 90% [6]. analytics for establishment of trends is the preferred The review [6] stated in 2008 that “current applications functionality in contrast to databases that deal with data are rarely applied outside of the laboratories they have items and extract subsets of data values. Visualization is been developed in, mostly because of scalability and an important feature of BI tools because they show generalizability issues”. Today, however, this is valid for generalizations and tendencies in one screen [16]. languages other than English because, with the active Another necessary feature is the speed of processing contribution of numerous research groups in the USA, since big data often appear in real time. NLP for English clinical narratives has much better In our project we use a BITool which stores data in performance at present. Comprehensive language n-dimensional cubes and explores multi-dimensional resources exist for English, such as UMLS [7] as well as data i.e. hyperplanes [17]. The user can split the dataset tools like KnowledgeMap Concept Identifier [8] which into groups of objects with similar features. If temporal processes clinical notes and returns CUIs (Concept dimension is included the user can track changes of Unique Identifiers) for the recognized UMLS terms. object characteristics over time by animation. BITool Another important tool is the public NegEx system enables the discovery of similar situations over time which identifies and interprets negations in English when a search pattern is specified for a particular period. clinical texts [9, 10]. We also mention the open-source cTAKES1 (clinical Text Analysis and Knowledge 2.4 Frequent Pattern Mining Extraction System) and the Health Information Text Extraction (HITEx) system [11]. A recent study [12] There are two principal tasks in pattern search: frequent enumerates the advantages to incorporate NLP for pattern mining (FPM) where the events (objects) are English in medical systems: it systematically links considered as unordered sets, and frequent sequence several terms to a concept using databases that mining (FSM). Approaches for solving the FPM task standardize health terminologies; avoids manual work vary from the naïve BruteForce and Apriori algorithms, for searching term variations; increases the number of where the search space is organized as a prefix tree, to patients in the considered cohorts and thus increases the Eclat algorithm that uses tidsets directly for support 1 http://ctakes.apache.org/ 231 computation by processing prefix equivalence classes they are extracted automatically from four XML fields: [18]. Most FPM and FSM methods do not consider (i) Anamnesis: summarizes case history, previous contextual information about extracted patterns. They treatments, often family history, risk factors; (ii) Status: usually build a (huge) prefix tree. Most FPM algorithms summary of patient state, height, weight, BMI, blood generate all possible frequent patterns (FPs). pressure etc.; (iii) Clinical tests: values of clinical Summarized information for data relations can be examinations and lab data listed in arbitrary order; (iv) extracted as maximal frequent itemsets (MFI) in order to Prescribed treatment: codes of drugs reimbursed by reduce redundancy and decrease significantly the NHIF, free text descriptions of other drugs. Integration number of FPs for post-analysis. All classic algorithms of large scale text analysis is a real novelty in this field. for FPM can be modified for MFI search. We have proposed a novel algorithm for mining sets 3.2 Analytics Using BITool of events in order to identify strong co-occurrence of Today the system BITool supports the Diabetes Register patterns [19]. It is a cascade data mining approach for at the University Specialized Hospital for Active FPM enriched with context information which aims at Treatment of Endocrinology ″Acad. Ivan Penchev″, the discovery of complex relations between medical Medical University – Sofia (this Hospital was authorized events with respective timestamps. Experiments with by the Bulgarian Ministry of Health to host the Register this approach are presented in Section 5 to illustrate the of diabetic patients in Bulgaria). BITool’s functionalities functionality of the Diabetes Register as a research tool. enable the monitoring of significant indicators like glycated hemoglobin (HbA1c) and blood glucose values. 3 Experimental Study In this way the Register achieves its objective: to provide an adequate monitoring strategy for diabetic patients and 3.1 Principal Objective to improve the healthcare and quality of life for the A pseudonymized Register of diabetic patients was patients and their families. Two examples illustrate the generated in 2015 from the Outpatient Records, collected services. Figure 1 shows the number of diabetic patients by the Bulgarian National Health Insurance Fund in the dimensions age-gender (at certain moment). Here (NHIF), in compliance with all legal requirements for BITool operates on the structured information from the safety and data protection [20]. The usual patient NHIF archive: patient pseudonym, age and gender. registration process was kept without burdening the Further statistics of this kind might concern explorations medical experts with additional paper work. NHIF is the of diabetic patients per region code, types of diabetes and only obligatory Insurance Fund in Bulgaria so we note diabetes complications, per GPs, per types of medication, that working with ORs ensures 100% registration of all according to frequency of visits etc. patients who contacted the healthcare system at all (however there are Bulgarian citizens who are not insured and some others who have ORs but are not properly diagnosed with Diabetes). The data repository, underpinning the Register, currently contains more than 262 mln pseudonymised ORs submitted to the NHIF in 2010-2016 for more than 7.3 mln Bulgarian citizens (more than 5 mln yearly), including 483,836 diabetic patients. In Bulgaria ORs are produced by General Practitioners (GPs) and Specialists from Ambulatory Care whenever they contact patients. Despite the primary accounting purpose ORs summarize sufficiently the case Figure 1 Number of diabetic patients grouped by age and motivate the requested reimbursement. They are semi-structured files with predefined XML-format. Many indicators in the Register copy the structured data submitted to NHIF in ORs: (i) date and time of the visit; (ii) pseudonymized personal data, age, gender; (iii) pseudonymised visit-related information; (iv) diagnoses in ICD-10; (v) NHIF drug codes for medications that are reimbursed; (vi) a code if the patient needs special monitoring; (vii) a code concerning the need for hospitalization; (viii) several codes for planned consultations, lab tests and medical imaging. ORs contain also important values presented in free text fields: glycated haemoglobin (HbA1c), body mass index (BMI), weight, blood glucose and blood pressure Figure 2 Reduction of HbA1c levels after application etc. These values are essential for a Diabetic Register so of incretin2 based drugs 2 https://www.drugs.com/drug-class/incretin-mimetics.html 232 Figure 2 explores the tendency in the development of 5 Research in Frequent Pattern Mining treatment. It displays the number of patients who had changes in the HbA1c levels within the interval [-5,5] 5.1 Contextual Information units for certain period of time. For most patients the Most FSM and FPM approaches do not use contextual HbA1c level decreased by 1 unit. The HbA1c levels are information about extracted patterns. These algorithms extracted from the free text of ORs for the corresponding extract general templates but do not answer the major patients with timestamp. question whether they are influenced in some way by the Finally we show the Register interface during the context and whether they are valid in various aspects. process of exploring the collection of ORs (Figure 3). Existing methods which search for patterns using The names and personal identifiers of patients and GPs contextual information are based on attributes that are are replaced by pseudonyms; only the name of the organized into hierarchical structures and on attributes’ city/village remains in the address field. generalizations and specializations. Context information is organized as attributes of itemsets and tidsets. Attributes may have different organization - structured or unstructured. This enables to explore the context-dependent templates. Rabatel et al. [22] propose an approach in marketing domain taking into account not only the transactions that have been made but also various attributes associated with customers like age, gender etc. Attributes have a hierarchical structure (𝐻(𝐴𝑔𝑒), 𝐻(𝐺𝑒𝑛𝑑𝑒𝑟)) and explore patterns at different levels of attributes abstraction – lattice 𝐻 (Figure 4). Traditional methods Figure 3 Exploring outpatient records in the Register consider only the top level [∗,∗] - for any age and regardless of gender, i.e. without attributes. Rabatel et al. 4 NLP for Bulgarian Clinical Narratives designed the algorithm Gespan and made experiments with about 100,000 product descriptions from Design and implementation of software for automatic amazon.com. extraction of patient-related entities from a Big Data collection is a quite challenging task. One needs to scale [*,*] H up existing research prototypes to process millions of H(age) * H(gender) patient records, coping with noisy and missing data, and young old [y,*] [o,*] [*,m] [*,f] * still providing reliable results. Some numeric entities (y) (o) refer to key risk factors for development of Diabetes male (m) female (f) [y,m] [y,f] [o,m] [o,f] Mellitus (levels of glycated hemoglobin HbA1c and blood glucose) and cardio-vascular diseases (high blood Figure 4 Structuring attributes in marketing domain pressure). Unfortunately in the Bulgarian clinical Ziembiński [23] proposes a new approach for practice these values are usually documented in free text extracting small contextual models from smaller paragraphs, presented in a huge variety of formats, so collections of data that later are summarized in their automatic identification is difficult. We note that generalized models using information from contextual according to some studies, today more than 80% of the models with common information. This approach applies patient-related clinical information is stored as free text a metrics for measuring distance of context models. All in the Electronic Health Record systems. values for similarity assessment are normalized in the In [21] we proposed a hybrid method for automatic range between 0.0 and 1.0. Attribute values are generation of grammar rules for IE from clinical data. considered identical if the similarity function returns 1.0. The experiments were made and evaluated over In the opposite case the result is 0.0. This approach approximately 9.5 million of ORs. Here we cite only the allows extracting patterns for data that would otherwise evaluation of blood pressure extraction from the ORs of have to be dropped out of the templates because of its about 1,800,000 patients with arterial hypertension for 3 dispersion and low frequency. year period: all available values are about 38.3 million and the extraction was performed with precision 92% 5.2 Experimental Setup and recall 98%. The variety of recording formats and explanations written by thousands of medical We apply a retrospective analysis for patients from the professionals require constant evaluation of grammar Diabetes Register with Diabetes Type 2. The period of coverage and extraction accuracy in general. Some of the interest is two years preceeding the onset of the Diabetes main advantages of the proposed method, beyond its Type 2, i.e. the so called prediabetes condition. In order reliable performance and good precision in text mining, to illustrate the potential of contextualized FPM we are the modularity, extensibility, and scalability. present results in searching comorbidities for patients in prediabet condition. Text mining modules are used to convert raw text descriptions to structured event data. 233 𝐸(𝑝𝑖 ). Let 𝐷 ⊆ 𝑃 × 2𝐼 be the set of all itemsets in our Outpatient collection after projection 𝜋 in the format Records 〈𝑝𝑖𝑑, 𝑖𝑡𝑒𝑚𝑠𝑒𝑡〉. We shall call 𝐷 a database. We are looking for itemsets 𝑋 ⊆ 𝐼 with frequency (sup(𝑋)) above given 𝑚𝑖𝑛𝑠𝑢𝑝. Let ℱ denote the set of all frequent itemsets, i.e. ℱ = {𝑋| 𝑋 ⊆ 𝐼 𝑎𝑛𝑑 sup(𝑋) ≥ 𝑚𝑖𝑛𝑠𝑢𝑝}. A Preprocessing frequent itemset 𝑋 ∈ ℱ is called maximal if it has no Structured Text Information Data frequent supersets. Let ℳ denote the set of all maximal Mining Modeling Processing frequent itemsets, i.e. ℳ = {𝑋| 𝑋 ∈ ℱ 𝑎𝑛𝑑 ∄ 𝑌 ∈ ℱ, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑋 ⊂ 𝑌}. Let 2𝑋 denote the power set (set of all subsets) of itemset 𝑋. Then each subset of 𝑋 ∈ ℱ is Data Analysis also a frequent itemset, i.e. ∀ 𝑌 ∈ 2𝑋 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑡ℎ𝑎𝑡 𝑌 ∈ ℱ . For each item 𝑖𝑑 ∈ 𝐼 we define the set called pidset: Structured Association Context Information Rules Information 𝑝(id) = {𝑝𝑖 | 〈𝑝𝑖 , 𝐼(𝑝𝑖 )〉 ∈ 𝐷 𝑎𝑛𝑑 𝑖𝑑 ∈ 𝐼(𝑝𝑖 )}. Processing Generation Processing To study the nature of comorbidities we need to investigate the context in which they occur. Therefore we add some semantic attributes to each event – Prediction & Prevention Models demographics of patients, age and gender, treatment, status, lab data and etc. Comorbidity Risk Factors We define a set of attributes of interest 𝐴 = Analysis {𝑎1 , 𝑎2 , … , 𝑎𝑘 }. Context Q for some patient 𝑝𝑖 ∈ 𝑃 is defined as the set of attribute-value pairs from patient Figure 5 System Architecture profile information: 𝑄(𝑝𝑖 ) = {〈𝑎1 , 𝑞1 〉, 〈𝑎2 , 𝑞2 〉, … , 〈𝑎𝑘 , 𝑞𝑘 〉}. The search space is very large: the database is big, the In order to decrease the number of possible values of number of diseases is also large. We propose a tabular attributes we apply some aggregation of data. For method using a vertical database, depth-first traversal as instance age value is categorized according to the World well as set intersection and diffsets [19]. Further Health Organization (WHO) standard age groups. Data processing of the maximal frequent itemsets (MFI) is for body mass index (BMI) are also categorized applied to remove diagnostic-related groups. In addition according to the WHO4 standard classification - some context information is added to each MFI to underweight, normal weight, overweight, obesity. investigate comorbidities. Furthermore association rules For some data concerning demographic information, with lift are generated. The context information is like region ID we have large number of distinct values. represented as attribute-value tuples for each patient; the For such data we add also some additional properties post-processing identifies the importance of different concerning background information for the region – e.g. attributes for each MFI. whether it is south, north, west, east, central, northwest The architecture of the experimental workbench is etc., and mountain, river, sea, thermal spring, urban shown on Figure 5. Our research [19] aims to develop region etc. For status and clinical test data we take the further the ideas of the two contextual approaches for worst value for the period, according to the risk factors data mining [22, 23]. deinition. For the collection S of ORs we extract the set of all In primary interest for Diabetes Type 2 are BMI, different patient identifiers 𝑃 = {𝑝1 , 𝑝2 , … , 𝑝𝑁 }. This set glycated haemoglobin, blood pressure (RR – Riva Roci), corresponds to transaction identifiers (tids) and we call blood glucose, HDL-cholesterol. them pids (patient identifiers). We consider each patient From 𝑄(𝑝𝑖 ) we generate a feature vector 𝑣(𝑝𝑖 ) = visit to a doctor as a single event. For each patient 𝑝𝑖 ∈ 𝑃 (𝑣1𝑖 , 𝑣2𝑖 , … , 𝑣𝑚𝑖 ), where each attribute 𝑎𝑗 ∈ 𝐴 with 𝑁𝑗 an event sequence of tuples 〈𝑒𝑣𝑒𝑛𝑡, 𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝〉 is possible values is represented by 𝑁𝑗 consecutive generated: 𝐸(𝑝𝑖 ) = (〈𝑒1 , 𝑡1 〉, 〈𝑒2 , 𝑡2 〉, … , 〈𝑒𝑘𝑖 , 𝑡𝑘𝑖 〉), 𝑖 = positions in the vector. For the set of maximal frequent ̅̅̅̅̅ 1, 𝑁. Let ℰ be the set of all possible events and 𝒯 be the itemset ℳ with cardinality |ℳ| = K we have K classes set of all possible timestamps. Let 𝐼 = {𝑖𝑑1 , 𝑖𝑑2 , … , 𝑖𝑑𝑝 } of comorbidities. We apply classification of multiple be the set of all diseases ICD-103 codes, which we call classes in order to generate rules for each comorbidity items. Each subset 𝑋 ⊆ 𝐼 is called an itemset. We define class. We use large scale multi class classification a projection function 𝜋: (ℰ × 𝒯)𝑁 → 2𝐼 : 𝜋(𝐸(𝑝𝑖 )) = because we deal with a big database and a large group of 𝐼(𝑝𝑖 ) = (𝑖𝑑1i , 𝑖𝑑2i , … , 𝑖𝑑𝑚𝑖 ), such that for each patient comorbidity classes. We use Support Vector Machines (SVM) and optimization based on block minimization 𝑝𝑖 ∈ 𝑃 the projected time sequence contains only the method described by Yu et al. [24]. first occurrence (onset) of each disorder recorded in 3 4 International Classification of Diseases and Related WHO, BMI Classification http://apps.who.int/bmi/ Health Problems 10th Revision. http://apps.who.int/ index.jsp?introPage=intro_3.html classifications/icd10/browse/2015/en 234 Table 1. Data analysis results for patients in prediabetes condition Set 2013 2014 2013-2014 ICD-10 ICD-10 ICD-10 ICD-10 ICD-10 ICD-10 items 3 signs 4 signs 3 signs 4 signs 3 signs 4 signs Patients 27,082 27,082 27,902 27,902 29,205 29,205 Outpatient Records 267,194 267,194 296,129 296,129 556,323 556,323 ICD-10 codes 1,142 4,701 1,145 4,834 1,257 5,503 minsup 0.01 0.01 0.01 0.01 0.01 0.01 Total MFI 203 486 219 512 521 1,406 Longest MFI 5 8 5 9 6 9 Frequent Itemsets 608 7,452 689 8,935 1,909 32,093 Association Rules 686 58,299 810 78,052 2,722 381,012 5.3 Experiments and Results Table 2. Data for attributes in the collections We report results for patients with Diabetes Type 2 onset in 2015. The ORs of these patients for the period 𝐴 attribute 2013 2014 2013-2014 2013-2014 were excerpted from the Diabetes Register 𝑎1 age 27,082 27,902 29,205 when, as we assume, these patients were in a pre-diabetes 𝑎2 gender 27,082 27,902 29,205 condition. The idea of this experiment is to check 𝑎3 region 27,082 27,902 29,205 whether we can successfully discover risk factors for 𝑎4 bmi 21,659 22,413 27,928 these patients looking only at their ORs in 2013 and 𝑎5 HbA1c 153 238 370 2014. Then, maping our hypotheses to the real data for HDL 2015, we test whether our approach is reasonable. (We 𝑎6 4,917 4,815 6,952 cholesterol note that due to the relatively short period of observation 𝑎7 blood glucose 11,925 12,185 17,016 and lack of data about mortality, at the moment we cannot follow diabetes development in longer periods.) In the Register each OR, corresponding to a single One of the generated Maximal Frequent Itemsets visit, cointains up to 4 diagnoses encoded in ICD-10. (comorbidity class), whose support contains the pid= Some diagnoses are presented by 4-sign encodings, i.e. 2196365, is: in a more specific way, while others use the more general MFI#12: Z00 I10 M51 #SUP: 453 3-sign encoding. Due to the hierarchical organization of The following charts show the distribution of patients ICD-10 we shall analyse individually two collections: in the support of "MFI#12" according to their age (Figure the original one, that is more specific (with 4-sign codes 6), gender (Figure 7), BMI (Figure 8), and the HDL - see Example 1) and we shall generalise also all Cholesterol (Figure 9) correspondingly. diagnoses to more general classes (with 3-sign codes - We can observe that most patients in this support set see Example 2). The examples present collections of have higher risk of Diabetes Type 2, due to the presence diagnoses for a patient with ID 2196365. of multiple risk factors as obesity, medium or high levels Example 1: of cholesterol and hypertension (diagnose with ICD-10 I(2196365)={I10, M10.9, M10, K76.9, K76, code I10). L94.1, L94, M06.9, G57.9, Z00.8, H53, M51.1, M33.9} AGE Example 2: I(2196365)={I10, M10, K76, L94, M06, M51, 200 M33, H53, Z00, G57} 150 For some patients, the available ORs contain no information about certain attributes of the context 100 information (Table 2). It is well known that missing data in medical documentation is inevitable. Thus some 50 attribute values are replaced by the value NA, which is considered as the most general value. 0 For example the context information for the patient 15-44 45-59 60-69 70-89 with ID 2196365 is: Figure 6 Age of the patients in the support set of "MFI#12" 𝑄 (2196365) = {〈𝑎𝑔𝑒, 58〉, 〈𝑔𝑒𝑛𝑑𝑒𝑟, 1〉, 〈𝑟𝑒𝑔𝑖𝑜𝑛, 03〉, 〈𝑏𝑚𝑖, 29.32〉, 〈ℎ𝑏𝑎1𝑐, 𝑁𝐴〉, 〈𝑏𝑙𝑜𝑜𝑑_𝑔𝑙𝑢𝑐𝑜𝑠𝑒, 6.39〉, 〈ℎ𝑑𝑙_𝑐ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙, 1.15〉} 235 with ICD-10 code M51 (Thoracic, thoracolumbar, and GENDER lumbosacral intervertebral disc disorders) means that the patients have lower motor activity and sedentary lifestyle, which causes obesity, overweight, higher male values of cholesterol and blood pressure and therefore 35% increases the risk of developing Diabetes. Actually this has happened in 2015. We note that in general the ICD- female 10 diagnose M51 is not considered risky for Diabetes. 65% But our algorithm reveals this unknown and latent interrelationship. 6 Conclusion and Future Work Figure 7 Gender of the patients in the support set of In this paper we present a software environment for "MFI#12" collection and processing of Big Data in medicine - a Data Intensive Domain. The Diabetes Register has been developed stepwise and its research functionality is still BMI underweight under construction. We believe that the integration of 0% various technologies is the proper way to approach the normal challenges of large-scale information processing because 9% the integration ensures flexible multi-functionality and obesity enables reuse of results. 55% overweight The nation-wide Diabetes Register of Bulgaria is now 36% visible in Internet5 together with some public statistical information. We plan to develop the Register further as a predictive and preventing tool using the synergy of advanced technologies which enable to discover risk groups of patients that have predisposition to various socially-significant diseases. We have shown here that Figure 8 BMI of the patients in the support set of the present software environment is mature enough to "MFI#12" identify patients with complexes of risk factors for development of Diabetes, e.g. risks like: family history (relatives with Diabetes); obesity; arterial hypertonia HDL CHOLESTEROL (RR>140/90); low physical activity; giving birth to a baby with weight more than 4 kg or gestational Diabetes; established impaired fasting glycaemia or impaired low high glucose tolerance; other states of insulin resistance (e.g. 21% 23% acanthosis nigricans, a specific hyperpigmentation of the skin that might be due to endocrine disorders); HDL- medium cholesterol≤0.90 mmol/l or triglycerides≥2.2 mmol/l (≥2.82 mmol/l according to ADA); diagnosed polycystic 56% ovarian syndrome, a cardio-vascular disease, or mental disorders etc. These risk factors are explicated in the patient-related documents either by values of clinical tests or by keywords and typical phrases that describe the Figure 9 Levels of HDL Cholesterol of the patients in factor. The patients with predisposition suffer from the support set of "MFI#12" disorders and syndromes, diagnosed by various medical specialists in various time periods, but without any Data about HbA1c are available only for 3 out of 453 chance to establish connections between the medical patients, that is why we consider this attribute as a more doctors – e.g. a connection between a Psychiatrist and a general value ANY. But we note that the lack of HbA1c Cardiologist that have consulted the patient. Elaborating measurements is not surprising because tests for HbA1c further the analytics facility of the Register will provide are made when the Diabetes is diagnosed (and this has functionality to monitor patient status over time, in the happened in 2015 for the selected patient cohort). context of all available information, and to issue alerts Data for blood glucose are available only for 30% of for coincidence of risk factors that open the door to these patient and for 50% of them the values were high. Diabetes and other chronic diseases. In this way we Deeper analyses reveal medical arguments why believe that in the foreseeable future it will become higher risk exist especially for the patients in the support possible to identify the Bulgarian citizens who have set of MFI#12: Z00 I10 M51 #SUP: 453. The diagnose predisposition to develop Diabetes Mellitus. 5 http://usbale.com/Register_Diabetes.htm 236 Acknowledgements [13] Boytcheva, S.: Shallow Medication Extraction from Hospital Patient Records. Studies in Health The research presented here is partially supported by the Technology and Informatics. vol. 166, pp. 119- grant 02/4 SpecialIZed Data MIning MethoDs Based on 128. IOS Press (2011) Semantic Attributes (IZIDA), funded by the National [14] Tcharaktchiev, D., Angelova, G., Boytcheva, S., Science Fund in 2017–2019. The support of Medical Angelov, Z., Zacharieva, S.: Completion of University – Sofia, the Bulgarian Ministry of Health and Structured Patient Descriptions by Semantic the National Health Insurance Fund is acknowledged. Mining. Studies in Health Technology and Informatics, vol. 166, pp. 260–269. IOS Press References (2011). doi: 10.3233/978-1-60750-740-6-260 [1] Carstensen, B. et al.: The Danish National [15] Laney, D.: 3D Data Management: Controlling Diabetes Register: Trends in incidence, prevalence Data Volume, Velocity, and Variety. META and mortality. Diabetologia. 51(12), 2187–2196 Group Research Note, 6, 10 (2001) (2008). doi: 10.1007/s00125-008-1156-z https://blogs.gartner.com/doug-laney/files/2012/ [2] Hallgren Elfgren, I. M., Grodzinsky, E., Törnvall, 01/ad949-3D-Data-Management-Controlling- E.: The Swedish National Diabetes Register in Data-Volume-Velocity-and-Variety.pdf clinical practice and evaluation in primary health [16] Top 238 Business Analytics Tools. Predictive care. Prim. Health Care Res. Dev. 17(6), 549-558 Analytics Magazine (Feb 2012). (2016). doi: 10.1017/S1463423616000098 http://www.predictiveanalyticstoday.com/top- [3] Cooper, J. G., Thue, G., Claudi, T., Løvaas, K., business-intelligence-tools/ Carlsen, S., Sandberg, S.: The Norwegian Diabetes [17] Angelova, G., Nikolova, I., Angelov, Zh.: Register for Adults – an overview of the first years. Embedding language technologies in a data Norsk Epidemiologi. 23(1), 29-34 (2013) analytics tool. Advances in Bulgarian Sciences, [4] O'Mullane, M., McHugh, S., Bradley, C. P.: pp. 29-42. National Centre for Information and Informing the development of a national diabetes Documentation (2016). ISSN: 1314-3565 register in Ireland: a literature review of the impact [18] Nasreen, S., Azam, M. A., Shehzad, K., Naeem, of patient registration on diabetes care. Inform. U., Ghazanfar, M. A.: Frequent Pattern Mining Primary Care. 18(3), 157-68 (2010) Algorithms for Finding Associated Frequent [5] Hallgren Elfgren, I.M., Törnvall, E., Grodzinsky, Patterns for Data Streams: A Survey. Procedia E.: The process of implementation of the diabetes Computer Science, 37, 109-116 (2014) register in Primary Health Care. Int. Journal of [19] Boytcheva, S., Angelova, G., Angelov, Z., Qual. Health Care. 24(4), 419-424 (Aug 2012) Tcharaktchiev, D.: Mining Comorbidity Patterns [6] Meystre, S., Savova, G., Kipper-Schuler, K., Using Retrospective Analysis of Big Collection of Hurdle, J. F.: Extracting Information from Textual Outpatient Records. Health Inf Sci Syst. Journal, Documents in the Electronic Health Record: A Springer (2017). ISSN: 2047-2501 (to appear) Review of Recent Research. IMIA Yearbook of [20] Tcharaktchiev, D., Zacharieva, S., Angelova, G., Medical Informatics, pp. 138-154. (2008) Boytcheva, S., et al. Building a Bulgarian National [7] UMLS, the Unified Medical Language System. Registry of Patients with Diabetes Mellitus. https://www.nlm.nih.gov/research/umls/ Journal of Social Medicine. 2, 19-21 (2015) (in Bulgarian) [8] Denny, J. C., Irani, P. R., Wehbe, F. H., Smithers, J. D., Spickard, A.: The KnowledgeMap Project: [21] Boytcheva, S., Angelova, G., Angelov, Z., Development of a Concept-Based Medical School Tcharaktchiev, D.: Text Mining and Big Data Curriculum Database. In: AMIA Annu Symp Analytics for Retrospective Analysis of Clinical Proc., pp. 195–199. (2003) Texts from Outpatient Care. Cybernetics and Information Technologies, 15(4), 58-77 (2015). [9] Chapman, W., Bridewell, W., Hanbury, P., doi: 10.1515/cait-2015-0055 Cooper, G. F., Buchanan, B.: A Simple Algorithm for Identifying Negated Findings and Diseases in [22] Rabatel, J., Bringay, S., Poncelet, P.: Mining Discharge Summaries. Univ. of Pittsburgh (2002) sequential patterns: a context-aware approach. Advances in Knowledge Discovery and [10] Gindl, S.: Negation Detection in Automated Management, pp. 23-41. Springer (2013) Medical Applications. TUW (2006) [23] Ziembiński, R. Z.: Accuracy of generalized [11] HITEx Manual: https://www.i2b2.org/software/ context patterns in the context based sequential projects/hitex/hitex_manual.html patterns mining. Control and Cybernetics. 40, 585- [12] Liao, K. P., Cai, T., Savova, G. K., Murphy, S. N., 603 (2011) Karlson, E. W., Ananthakrishnan, A. N., Gainer, [24] Yu, H. F., Hsieh, C. J., Chang, K. W., Lin, C. J.: V. S. et al.: Development of phenotype algorithms Large linear classification when data cannot fit in using electronic medical records and incorporating memory. ACM Transactions on Knowledge natural language processing. British Med. J., 350 Discovery from Data (TKDD), 5(4), 23 (2012) (1): h1885 (2015) 237