Methods for the Intelligent Analysis of Biomedical Data E.V. Geger1, A.G. Podvesovskii2, S.A. Kuzmin2, V.P. Tolstenok2 emiliya_geger@mail.ru|apodv@tu-bryansk.ru|wolv3333@mail.ru|tolstenok21@yandex.ru 1 Bryansk Clinicodiagnostic Center, Bryansk, Russia 2 Bryansk State Technical University, Bryansk, Russia The paper discusses methodology of cleaning and analysis of small semi-structured samples of biomedical data. This methodology is aimed at statistical evaluation of harmful production factor correlation with workers’ laboratory test data. As a result of the analysis and interpretation of the data, a deviation from the norm is observed according to some indicators of a clinical blood test in individuals whose occupational activity is associated with harmful factors. Conclusions are drawn about the need for further research in the group of people whose work is related to harmful production factors. It is necessary to employ intelligent methods for analyzing possible health risks and their negative consequences in order to make management decisions. The presented assessment methodology can be used to create an occupational health and safety information system. Keywords: risk assessment, data analysis, harmful working conditions, statistical methods, data cleaning, model ensembles, Kohonen self-organizing maps. New analytical technologies can not only increase the 1. Introduction efficiency of medical institutions but also make it possible to solve such health problems as identifying diagnoses, medical In the modern world, the necessity often arises for auxiliary errors, associative connection of diagnoses with results of methods of preliminary disease detection at early stages. This laboratory tests and much more. problem is especially characteristic of people whose labor The objective of the present research has been to assess activity is associated with the constant impact of harmful relationship between occupational morbidity and harmful working conditions [6, 10]. production factors. So the experimental sample consisted of In turn, increased level of exposure to harmful substances individuals whose occupational activities were associated with and related industrial hazards significantly increase the harmful and dangerous working conditions. Another objective likelihood of developing occupational diseases and the risk of was development of a new methodology focused on processing injury [8]. and analyzing small semi-structured samples of biomedical data. Thus, occupational risk management is a complex of organizational and technical measures which should be based on 2. Existing Solutions reliable results of data analysis [9]. For the tasks of occupational risk assessment, it is necessary Among many proposed methods and tools for working with to use medical data analysis methods that correspond to these small samples, it is advisable to apply ensemble data analysis tasks. Choice of methods affects the construction of theoretical methods in combination with the basic classifier – Kohonen self- biomedical models and characteristics of experimental studies [7, organizing maps. 17]. An ensemble of models is a combination of several learning However, many experts note that biomedical data are often algorithms that, working together, help to build a model more unsuitable for processing using traditional software not only efficient and accurate than any of the models built using a because of their volume but also because of the variety of data separate algorithm. That is, to find a solution for one problem or types and speed at which they must be analyzed [1, 3, 15]. to prove a hypothesis, not one but several models are used. Therefore, a system for the intelligent analysis of medical data is Besides, the overall operating result matters and not that of a required, which could aggregate and analyze heterogeneous single separate model. information coming from different sources: electronic medical Formerly, researchers faced the problem of combining records, data from monitoring sensors, ultrasound and X-ray accuracy, simplicity and ease of interpretation in one method. A apparatus and other devices [4]. simple method could be used with a relatively easy interpretation If we consider real biomedical data, they have a number of but coming short of accuracy or vice versa, a complex but specific features that make them unstructured: presence of accurate method could be chosen that would be difficult to various data corruptions, such as omissions, extreme values, interpret. Ensembles of models have become a solution to this manual entry errors, incorrect information, high dimensionality problem as a universal way of improving the accuracy of and heterogeneity, a large number of noisy and duplicate data. methods. This leads to the unsuitability of most of the data or the entire Ensemble learning refers to the training of a finite set of basic sample for existing analysis algorithms [16]. classifiers with the subsequent combination of their forecasting Analysts believe that today, to solve many tasks of the results into a single forecast of an aggregated classifier. Thus, an healthcare system, it is necessary to go in the direction of aggregated classifier will give a more accurate result. structuring information and focusing on work with small samples The goal of combining models is to improve (enhance) the [5]. solution provided by a separate model. It is assumed that a single Particular attention should be paid to the so-called small model will never be able to achieve the efficiency that the samples, the volume of which is about 100-200 records. This, in ensemble will provide. turn, is very difficult and makes the use of existing methods of Self-organizing maps are a special type of artificial neural data processing and analysis ineffective. network allowing for non-linear regression and projection of The main obstacle to the manual analysis of small samples is multidimensional data onto a two-dimensional plane with the inability of the analyst to notice hidden patterns in presented preservation of distances in their original data space [2]. This data, while special analytical algorithms detect existing patterns approach can be applied in various fields, including biomedical with much greater efficiency. The main task of the analyst, in this data processing. case, is to interpret the results of these algorithms and filter out false and trivial patterns. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 3. Proposed Theoretical Solution features of this model, in addition to the fact that it does not allow meaningless data loss, are: To solve the research problems of working environment risk 1. Correction of empty values. assessment, we formed two groups: 2. Deletion of entries containing no information. Group I included individuals whose labor activity was 3. Recalculation of deviations by scores. associated with exposure to harmful production factors. 4. Recovery of missing scores by deviations. Group II consisted of individuals whose professional activity 5. Reassignment of diagnoses by ICD codes. lacked harmful production factor. 6. Assigning a unique number to each patient. The studies were carried out in the laboratory of Bryansk 7. Increasing the number of records suitable for analysis Clinical Diagnostic Center; the results were reflected in the without generating synthetic patients. medical information system "MAIS DC". 8. Download and comparison with ICD certified directory The list of occupational diseases was determined in of codes and their descriptions. accordance with the Order of the Ministry of Health and Social 9. Uploading data that has been damaged and Development of the Russian Federation No. 417n dated April 27, subsequently corrected for additional possible analysis. 2012 “On the Approval of the List of Occupational Diseases”. 10. Converting source data into a form suitable for Those working in harmful labor conditions are at risk for diseases intelligent analysis methods. associated with exposure to occupational physical factors [12]. At the second stage, development and construction of a The studies were carried out in accordance with the Order of model ensemble takes place. To simplify the development the Ministry of Health of the Russian Federation dated April 12, process, the following algorithm must be followed: 2011 No. 302n (as amended on February 06, 2018) “On the 1. Select a base model. Approval of Lists of Harmful and (or) Hazardous Occupational An ensemble can consist of classifiers of one type (e.g., only Factors and Kinds of Work that Require Mandatory Preliminary of decision trees or only of neural networks) or of classifiers of and Periodical Medical Examinations (Surveys), and the various types (decision trees, neural networks, regression Procedure for Conducting Mandatory Preliminary and Periodical models, etc.). Medical Examinations (Surveys)” [13]. 2. Define the approach to using the learning set. To assess the possible risk of diseases caused by harmful This can be resampling (several subsamples are extracted occupational factors, the values of clinical blood test scores were from the original learning set, each of which is used to train one used as a source of information. The experiment was carried out of the ensemble models) or the use of one learning set to train all in compliance with the ethical principles of biomedical research ensemble classifiers. and in accordance with the Federal Law of the Russian 3. Select a method for combining results. Federation No. 152 "On Personal Data" [14]. Three methods are usually used: voting (the class is selected For morbidity analysis, the "International Statistical that has been produced by a simple majority of ensemble Classification of Diseases and Related Health Problems" of the models), weighted voting (the result is delivered taking into tenth revision (ICD-10) was used [11]. account weights set for ensemble models) and averaging (the To solve the problem, a methodology was proposed output of the entire ensemble is defined as the simple average consisting of the following stages: value of outputs of all models; in weighted averaging, the outputs 1. Data cleaning. of all models are multiplied by the corresponding weights). 2. Model ensemble development. As a result of applying this algorithm, a basic ensemble of 3. Result interpretation. models has been obtained (Fig. 1), which can be further modified To carry out the first step, a special model has been designed and complexified. and developed, the task of which is to clean the source data and convert unstructured data into ordered ones. The resulting model can be roughly divided into five parts. Data import. It includes setting field names and labels, excluding empty fields, setting data types. Data preprocessing. It is a submodel that generates two sets of data at the output – intact data, containing no errors, and corrupted data, which are input to the next submodel. Data cleaning. This submodel receives damaged data as input and, after appropriate processing, provides output of cleaned data that does not contain initial errors. In particular, this submodel sets correct deviation values. In addition to the cleared data, the output of the submodel contains data sets with incorrect values in deviations and in scores. Data merge. In this part, aggregation of initially complete and cleared data takes place. In addition to this, using a code written in the JavaScript programming language, patient’s Fig. 1. Structure of the developed model ensemble unique identifier is generated and all diagnoses are divided into several records, so that three records containing one diagnosis are The data that has undergone preliminary cleaning at the first obtained from one record containing three diagnoses. This, in stage will be the input data. turn, allows for data structuring. The converted data can be used At the third stage, the results obtained during the ensemble when applying transaction analysis algorithm. Due to the errors application are interpreted. In most cases, for their qualitative in the diagnosis name, it was decided to download an excel file interpretation, it is necessary to contact specialists in the field of containing codes and names of 12257 diagnoses of the medicine. international classification of diseases of the tenth revision [11]. For automatic analysis and visualization, ready-made In the resulting data set, the initial field “Diagnosis Name” analytical platform Deductor Studio was used, and, for has been replaced by the name of the diagnosis from this file. preprocessing and data cleaning, Loginom software products Data export. It creates a file with the extension .txt, suitable developed by BaseGroupe Labs [18, 19] were used. for uploading to analytical platforms and further analysis. Thanks to the preprocessing of the initial sample, it is Also, this age group is characterized by significant positive possible either to level out completely or to minimize the errors deviations from the norm in terms of hemoglobin, red blood related to the human factor, which makes it possible to speak cells, and erythrocyte sedimentation rate. about sufficient correctness of the medical research. The increase in red blood cells is especially noticeable in the Unfortunately, small semi-structured samples cannot be called group of individuals whose work is associated with harmful sufficiently representative. However, the identified patterns may occupational factors. Erythrocytosis could be caused by external be of interest to experts or for further verification. factors. Survey results may indicate the presence of pathological 4. Proposed Practical Solution processes in the body. Additional studies are needed to diagnose the disease that has caused high content of red blood cells. Laboratory test analysis has been carried out in workers The methodology for analyzing small semi-structured whose occupation is related to harmful factors (Group I) and a samples of biomedical data considered in the article can be used control group (Group II). with adequate efficiency as an integral part of the risk analysis of Two small samples have been formed: the first sample diseases related to harmful occupational factors. Application of includes 100 records; the second one includes 207 records for the proposed method is demonstrated on specific actual data, each of the scores taken into account for each group. A total of 9 collection and consolidation of which has been carried out using scores of the general blood test were selected. medical automated information system. This provides reliable Data on the primary disease incidence in workers from evaluation of harmful occupational factor effects on working Groups I and II have been analyzed. The results of diagnoses population’s health indicators. processing are presented. The research results will help to define and evaluate The results of the study made it possible to identify the occupational risk factors that increase the likelihood of disease diagnoses that most often occurred in Group I and Group II: development and to draft proposals for the prevention of harmful E78.0 – Pure hypercholesterolemia; effects of occupational factors on working citizens’ health. G90 – Disorder of autonomic nervous system, unspecified; H52.4 – Presbyopia; 5. Conclusion I10 – Essential [primary] hypertension. The cross-table as an interactive tool for data representation The article discusses a methodology tested on specific actual and analytical processing has allowed creating a pivot table data, which allows pre-processing, cleaning and analysis of small which represents data on quantitative composition of diseases for biomedical data samples. each group. As a result of the analysis, there have been revealed no The use of Kohonen self-organizing maps helped to identify statistically significant difference in blood indices in individuals differences and patterns between the studied groups. from groups I and II. Examples of Kohonen self-organizing maps used are Also, the study of the diagnoses of the primary disease presented in Fig. 2. incidence in both groups has not revealed statistically significant differences between the groups. However, the established significant deviations from the normal content of red blood cells and eosinophils in the blood of individuals from Group I may indicate the presence of certain pathological processes in the body, in particular, autoimmune processes, identification of which requires additional research. In the future work perspective, it is advisable to conduct a research on these clinical blood test results using the traditional analysis of variance method regardless of normal intervals. This will allow for comparing the results obtained. Hereupon, it will be possible to draw final conclusions about the influence of specific occupational factors on the development of pathological processes. The results obtained help to increase the efficiency of detecting patterns in biomedical databases and are of interest for the construction of intelligent systems designed to analyze and assess human health. 6. Acknowledgements The reported study was funded by RFBR, project number 19- 07-00844. Fig.2. An example of data processing results using 7. Литература Kohonen self-organizing maps [1] Bruce McCormick (2014) Update in Anaesthesia. World Federation of Societies of Anaesthesiologists. 466 p. It should be noted that in both groups at least 24% of [2] Kohonen T. The Self-Organizing Map // Proceeding of employees were people aged 54 and over. In this regard, it has the IEEE. 1990. Vol. 78. P. 1464-1480. been suggested that these diseases may be associated primarily [3] Manyika J., Chui M., Brown B. et al. (2011) Big Data: with age-related changes, for example, age-related decrease in The Next Frontier for Innovation, Competition, and Productivity the accommodative ability of the eye associated with the natural / McKinsey Global Institute. process of aging of the lens, a prolonged and persistent blood [4] Baranov A.A., Namazova-Baranova L.S., Smirnova pressure increase, blood cholesterol increase. I.V. et al. Methods and Tools for Complex Intelligent Analysis of Medical Data. Trudy ISA RAN. Vol. 65. 2.2015. pp. 81-93. [5] Barriers and Prospects for Digital Transformation: Big Data Management Issues in the Healthcare Industry [Online]. – Available: http://www.medlinks.ru/article.php?sid=83028 (Accessed: July 23, 2019). [6] Geger E.V., Fedorenko S.I. Information Support of Decision-Making when Assessing the Risk for Occupational Morbidity Based on Analysis of Binary Samples // Proceedings of Southwest State University. Control, Computer Engineering, Information Science. Medical Instruments Engineering, no. 2 (27). 2018. pp. 101-107. [7] Healthcare will Show the Largest Increase in Data Generation by 2025 [Online]. – Available: http://apcmed.ru/news/news-all/zdravookhranenie-pokazhet- naibolshiy-rost-v-generatsii-dannykh-k-2025-godu/ (Accessed: July 23, 2019). [8] Izmerov N.F., Actualization of Occupational Morbidity Issues // Health Care of the Russian Federation (Zdravookhraneniye Rossiyskoy Federatsii.), no. 2. 2013. pp. 14- 17. [9] Ismailova L.N. Effective Management of Production Risks. // Economy and Business: Theory and Practice. 2016. No. 5. pp. 77-79. [10] Kostenko N.A. Working Conditions and Occupational Morbidity as a Basis for Risk Management of Workers' Health: abstract of cand. med. sci. diss. M., 2015. 21 p. [11] International Classification of Diseases of the tenth revision (ICD-10) [Online]. – Available: https://mkb-10.com (Accessed: July 20, 2019). [12] The Order of the Ministry of Health and Social Development of the Russian Federation No. 417n dated 27/04/2012 “On the Approval of the List of Occupational Diseases”. [Online]. – Available: http://base.garant.ru/70177874/ (дата обращения 23.05.2019). [13] The Order of the Ministry of Health of the Russian Federation dated 12/04/2011 No. 302n (as amended on 06/02/2018) “On the Approval of Lists of Harmful and (or) Hazardous Occupational Factors and Kinds of Work that Require Mandatory Preliminary and Periodical Medical Examinations (Surveys), and the Procedure for Conducting Mandatory Preliminary and Periodical Medical Examinations (Surveys)”. [Online]. – Available: http://base.garant.ru/12191202/ (Accessed: May 21, 2019). [14] The Federal Law dated 27/07/2006 No. 152-FZ (as amended on 29/07/2017) "On Personal Data". [Online]. – Available: http://base.garant.ru/5635295/ (Accessed: May 21, 2019). [15] Tsvetkova L.A., Cherchenko O.V. Big Data Technology in Medicine and Healthcare in Russia and the World // Information technologies for the Physician, 2016. No. 3. pp. 60-73. [16] Tsygankova, I.A. Method of Intelligent Processing of Biomedical Data [Text] / I.A. Tsygankova // Software Products and Systems. – 2009. –no. 3. – pp. 120-123. [17] Choporov O.N., Razinkin K.A. Optimization Model of Choice of the Initial Plan of Control Actions for Medical Information Systems / Control Systems and Information Technology. 2011. Vol. 46. No. 4.1. P. 185-187. [18] BaseGroup. Data Analysis Technologies [Online] // Available: https://basegroup.ru/ (Accessed: May 20, 2019). [19] Loginom [Online] // Available: https://loginom.ru/ (Accessed: May 20, 2019).